1
|
Thind AS, Sinha S. Using Chaos-Game-Representation for Analysing the SARS-CoV-2 Lineages, Newly Emerging Strains and Recombinants. Curr Genomics 2023; 24:187-195. [PMID: 38178984 PMCID: PMC10761335 DOI: 10.2174/0113892029264990231013112156] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/15/2023] [Revised: 08/09/2023] [Accepted: 09/15/2023] [Indexed: 01/06/2024] Open
Abstract
Background Viruses have high mutation rates, facilitating rapid evolution and the emergence of new species, subspecies, strains and recombinant forms. Accurate classification of these forms is crucial for understanding viral evolution and developing therapeutic applications. Phylogenetic classification is typically performed by analyzing molecular differences at the genomic and sub-genomic levels. This involves aligning homologous proteins or genes. However, there is growing interest in developing alignment-free methods for whole-genome comparisons that are computationally efficient. Methods Here we elaborate on the Chaos Game Representation (CGR) method, based on concepts of statistical physics and free of sequence alignment assumptions. We adopt the CGR method for classification of the closely related clades/lineages A and B of the SARS-Corona virus 2019 (SARS-CoV-2), which is one of the fastest evolving viruses. Results Our study shows that the CGR approach can easily yield the SARS-CoV-2 phylogeny from the available whole genomes of lineage A and lineage B sequences. It also shows an accurate classification of eight different strains and the newly evolved XBB variant from its parental strains. Compared to alignment-based methods (Neighbour-Joining and Maximum Likelihood), the CGR method requires low computational resources, is fast and accurate for long sequences, and, being a K-mer based approach, allows simultaneous comparison of a large number of closely-related sequences of different sizes. Further, we developed an R pipeline CGRphylo, available on GitHub, which integrates the CGR module with various other R packages to create phylogenetic trees and visualize them. Conclusion Our findings demonstrate the efficacy of the CGR method for accurate classification and tracking of rapidly evolving viruses, offering valuable insights into the evolution and emergence of new SARS-CoV-2 strains and recombinants.
Collapse
Affiliation(s)
- Amarinder Singh Thind
- Department of Biological Sciences, Indian Institute of Science Education & Research, Mohali, India
- Illawarra Shoalhaven Local Health District (ISLHD), NSW Health, Australia
| | - Somdatta Sinha
- Department of Biological Sciences, Indian Institute of Science Education & Research, Mohali, India
| |
Collapse
|
2
|
Anjum N, Nabil RL, Rafi RI, Bayzid MS, Rahman MS. CD-MAWS: An Alignment-Free Phylogeny Estimation Method Using Cosine Distance on Minimal Absent Word Sets. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2023; 20:196-205. [PMID: 34928803 DOI: 10.1109/tcbb.2021.3136792] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/04/2023]
Abstract
Multiple sequence alignment has been the traditional and well established approach of sequence analysis and comparison, though it is time and memory consuming. As the scale of sequencing data is increasing day by day, the importance of faster yet accurate alignment-free methods is on the rise. Several alignment-free sequence analysis methods have been established in the literature in recent years, which extract numerical features from genomic data to analyze sequences and also to estimate phylogenetic relationship among genes and species. Minimal Absent Word (MAW) is an effective concept for representing characteristics of a sequence in an alignment-free manner. In this study, we present CD-MAWS, a distance measure based on cosine of the angle between composition vectors constructed using minimal absent words, for sequence analysis in a computationally inexpensive manner. We have benchmarked CD-MAWS using several AFProject datasets, such as Fish mtDNA, E.coli, Plants, Shigella and Yersinia datasets, and found it to perform quite well. Applied on several other biological datasets such as mammal mtDNA, bacterial genomes and viral genomes, CD-MAWS resolved phylogenetic relationships similar to or better than state-of-the-art alignment-free methods such as Mash, Skmer, Co-phylog and kSNP3.
Collapse
|
3
|
Sun N, Yau SST. In-depth investigation of the point mutation pattern of HIV-1. Front Cell Infect Microbiol 2022; 12:1033481. [PMID: 36457853 PMCID: PMC9705751 DOI: 10.3389/fcimb.2022.1033481] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/31/2022] [Accepted: 10/25/2022] [Indexed: 04/29/2024] Open
Abstract
Mutations may produce highly transmissible and damaging HIV variants, which increase the genetic diversity, and pose a challenge to develop vaccines. Therefore, it is of great significance to understand how mutations drive the virulence of HIV. Based on the 11897 reliable genomes of HIV-1 retrieved from HIV sequence Database, we analyze the 12 types of point mutation (A>C, A>G, A>T, C>A, C>G, C>T, G>A, G>C, G>T, T>A, T>C, T>G) from multiple statistical perspectives for the first time. The global/geographical location/subtype/k-mer analysis results report that A>G, G>A, C>T and T>C account for nearly 64% among all SNPs, which suggest that APOBEC-editing and ADAR-editing may play an important role in HIV-1 infectivity. Time analysis shows that most genomes with abnormal mutation numbers comes from African countries. Finally, we use natural vector method to check the k-mer distribution changing patterns in the genome, and find that there is an important substitution pattern between nucleotides A and G, and 2-mer CG may have a significant impact on viral infectivity. This paper provides an insight into the single mutation of HIV-1 by using the latest data in the HIV sequence Database.
Collapse
Affiliation(s)
- Nan Sun
- Department of Mathematical Sciences, Tsinghua University, Beijing, China
| | - Stephen S.-T. Yau
- Department of Mathematical Sciences, Tsinghua University, Beijing, China
- Yanqi Lake Beijing Institute of Mathematical Sciences and Applications, Beijing, China
| |
Collapse
|
4
|
Jiao X, Pei S, Sun Z, Kang J, Yau SST. Determination of the nucleotide or amino acid composition of genome or protein sequences by using natural vector method and convex hull principle. FUNDAMENTAL RESEARCH 2021. [DOI: 10.1016/j.fmre.2021.08.010] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022] Open
|
5
|
Literman R, Schwartz R. Genome-Scale Profiling Reveals Noncoding Loci Carry Higher Proportions of Concordant Data. Mol Biol Evol 2021; 38:2306-2318. [PMID: 33528497 PMCID: PMC8136493 DOI: 10.1093/molbev/msab026] [Citation(s) in RCA: 7] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/11/2022] Open
Abstract
Many evolutionary relationships remain controversial despite whole-genome sequencing data. These controversies arise, in part, due to challenges associated with accurately modeling the complex phylogenetic signal coming from genomic regions experiencing distinct evolutionary forces. Here, we examine how different regions of the genome support or contradict well-established relationships among three mammal groups using millions of orthologous parsimony-informative biallelic sites (PIBS) distributed across primate, rodent, and Pecora genomes. We compared PIBS concordance percentages among locus types (e.g. coding sequences (CDS), introns, intergenic regions), and contrasted PIBS utility over evolutionary timescales. Sites derived from noncoding sequences provided more data and proportionally more concordant sites compared with those from CDS in all clades. CDS PIBS were also predominant drivers of tree incongruence in two cases of topological conflict. PIBS derived from most locus types provided surprisingly consistent support for splitting events spread across the timescales we examined, although we find evidence that CDS and intronic PIBS may, respectively and to a limited degree, inform disproportionately about older and younger splits. In this era of accessible wholegenome sequence data, these results:1) suggest benefits to more intentionally focusing on noncoding loci as robust data for tree inference and 2) reinforce the importance of accurate modeling, especially when using CDS data.
Collapse
Affiliation(s)
- Robert Literman
- Department of Biological Sciences, University of Rhode Island, South Kingstown, RI, USA.,Center for Food Safety and Applied Nutrition, Office of Regulatory Science, U.S. Food and Drug Administration, College Park, MD, USA
| | - Rachel Schwartz
- Department of Biological Sciences, University of Rhode Island, South Kingstown, RI, USA
| |
Collapse
|
6
|
Ramanathan N, Ramamurthy J, Natarajan G. Numerical Characterization of DNA Sequences for Alignment-free Sequence Comparison - A Review. Comb Chem High Throughput Screen 2021; 25:365-380. [PMID: 34382516 DOI: 10.2174/1386207324666210811101437] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/30/2020] [Revised: 06/16/2021] [Accepted: 06/24/2021] [Indexed: 11/22/2022]
Abstract
BACKGROUND Biological macromolecules namely, DNA, RNA, and protein have their building blocks organized in a particular sequence and the sequential arrangement encodes evolutionary history of the organism (species). Hence, biological sequences have been used for studying evolutionary relationships among the species. This is usually carried out by multiple sequence algorithms (MSA). Due to certain limitations of MSA, alignment-free sequence comparison methods were developed. The present review is on alignment-free sequence comparison methods carried out using numerical characterization of DNA sequences. <P> Discussion: The graphical representation of DNA sequences by chaos game representation and other 2-dimesnional and 3-dimensional methods are discussed. The evolution of numerical characterization from the various graphical representations and the application of the DNA invariants thus computed in phylogenetic analysis is presented. The extension of computing molecular descriptors in chemometrics to the calculation of new set of DNA invariants and their use in alignment-free sequence comparison in a N-dimensional space and construction of phylogenetic tress is also reviewed. <P> Conclusion: The phylogenetic tress constructed by the alignment-free sequence comparison methods using DNA invariants were found to be better than those constructed using alignment-based tools such as PHLYIP and ClustalW. One of the graphical representation methods is now extended to study viral sequences of infectious diseases for the identification of conserved regions to design peptide-based vaccine by combining numerical characterization and graphical representation.
Collapse
Affiliation(s)
- Natarajan Ramanathan
- Department of Chemistry, Sri Sarada Niketan College for Women, Karur-639005, Tamil Nadu. India
| | - Jayalakshmi Ramamurthy
- Department of Computer Science, Sri Sarada Niketan College for Women, Karur-639005, Tamil Nadu. India
| | - Ganapathy Natarajan
- Department of Mechanical Engineering and Industrial Engineering, University of Wisconsin, Platteville, WI 53818. United States
| |
Collapse
|
7
|
Cloutier Barbour C, Vazquez K, Hammond E. Diagnosis and treatment of a poorly differentiated carcinoma in a male chimpanzee (Pan troglodytes)-A case study. J Med Primatol 2021; 50:219-221. [PMID: 34111311 DOI: 10.1111/jmp.12531] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/07/2020] [Revised: 04/05/2021] [Accepted: 05/17/2021] [Indexed: 11/26/2022]
Abstract
This study reports the occurrence of a poorly differentiated carcinoma in a captive-born 28 year-old male chimpanzee (Pan troglodytes) who has a familial history of cancer. Pathological findings, surgical interventions, and experimental treatments are discussed.
Collapse
|
8
|
Pei S, Yau SST. Analysis of the Genomic Distance Between Bat Coronavirus RaTG13 and SARS-CoV-2 Reveals Multiple Origins of COVID-19. ACTA MATHEMATICA SCIENTIA = SHU XUE WU LI XUE BAO 2021; 41:1017-1022. [PMID: 33897081 PMCID: PMC8054123 DOI: 10.1007/s10473-021-0323-x] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 02/25/2021] [Revised: 03/10/2021] [Indexed: 05/29/2023]
Abstract
The severe acute respiratory syndrome COVID-19 was discovered on December 31, 2019 in China. Subsequently, many COVID-19 cases were reported in many other countries. However, some positive COVID-19 samples had been reported earlier than those officially accepted by health authorities in other countries, such as France and Italy. Thus, it is of great importance to determine the place where SARS-CoV-2 was first transmitted to human. To this end, we analyze genomes of SARS-CoV-2 using k-mer natural vector method and compare the similarities of global SARS-CoV-2 genomes by a new natural metric. Because it is commonly accepted that SARS-CoV-2 is originated from bat coronavirus RaTG13, we only need to determine which SARS-CoV-2 genome sequence has the closest distance to bat coronavirus RaTG13 under our natural metric. From our analysis, SARS-CoV-2 most likely has already existed in other countries such as France, India, Netherland, England and United States before the outbreak at Wuhan, China.
Collapse
Affiliation(s)
- Shaojun Pei
- Department of Mathematical Sciences, Tsinghua University, Beijing, 100084 China
| | - Stephen S.-T. Yau
- Department of Mathematical Sciences, Tsinghua University, Beijing, 100084 China
- Yanqi Lake Beijing Institute of Mathematical Sciences and Applications, Beijing, 101408 China
| |
Collapse
|
9
|
Qi Z, Wen X. Novel Protein Sequence Comparison Method Based on Transition Probability Graph and Information Entropy. Comb Chem High Throughput Screen 2020; 25:392-400. [PMID: 32875978 DOI: 10.2174/1386207323666200901103001] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/31/2020] [Revised: 07/17/2020] [Accepted: 07/17/2020] [Indexed: 11/22/2022]
Abstract
AIM AND OBJECTIVE Sequence analysis is one of the foundations in bioinformatics. It is widely used to find out the feature metric hidden in the sequence. Otherwise, the graphical representation of biologic sequence is an important tool for sequencing analysis. This study is undertaken to find out a new graphical representation of biosequences. MATERIALS AND METHODS The transition probability is used to describe amino acid combinations of protein sequences. The combinations are composed of amino acids directly adjacent to each other or separated by multiple amino acids. The transition probability graph is built up by the transition probabilities of amino acid combinations. Next, a map is defined as a representation from transition probability graph to transition probability vector by k-order transition probability graph. Transition entropy vectors are developed by the transition probability vector and information entropy. Finally, the proposed method is applied to two separate applications, 499 HA genes of H1N1, and 95 coronaviruses. RESULTS By constructing a phylogenetic tree, we find that the results of each application are consistent with other studies. CONCLUSION the graphical representation proposed in this article is a practical and correct method.
Collapse
Affiliation(s)
- Zhaohui Qi
- College of Information Science and Engineering Hunan Normal University, Changsha 410081. China
| | - Xinlong Wen
- College of Information Science and Engineering Hunan Normal University, Changsha 410081. China
| |
Collapse
|
10
|
Yu Y, Yang J, Ma W, Pressel S, Liu H, Wu Y, Schneider H. Chloroplast phylogenomics of liverworts: a reappraisal of the backbone phylogeny of liverworts with emphasis on Ptilidiales. Cladistics 2019; 36:184-193. [DOI: 10.1111/cla.12396] [Citation(s) in RCA: 16] [Impact Index Per Article: 3.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 05/30/2019] [Indexed: 01/20/2023] Open
Affiliation(s)
- Ying Yu
- College of Life and Environmental Sciences Hangzhou Normal University Hangzhou 311121 China
| | - Jun‐Bo Yang
- CAS Plant Germplasm and Genomics Center Germplasm Bank of Wild Species Kunming Institute of Botany Chinese Academy of Sciences Kunming 650201 China
| | - Wen‐Zhang Ma
- CAS Key Laboratory for Plant Biodiversity and Biogeography of East Asia Kunming Institute of Botany Chinese Academy of Sciences Kunming 650201 China
| | - Silvia Pressel
- Department of Life Sciences Natural History Museum London SW7 5BD UK
| | - Hong‐Mei Liu
- Key Laboratory of Tropical Plant Resources and Sustainable Use Xishuangbanna Tropical Botanical Garden Chinese Academy of Sciences Menglun Yunnan 666303 China
| | - Yu‐Huan Wu
- College of Life and Environmental Sciences Hangzhou Normal University Hangzhou 311121 China
| | - Harald Schneider
- Center of Integrative Conservation Xishuangbanna Tropical Botanical Garden Chinese Academy of Sciences Menglun Yunnan 666303 China
| |
Collapse
|
11
|
Zhao Y, Xue X, Xie X. An alignment-free measure based on physicochemical properties of amino acids for protein sequence comparison. Comput Biol Chem 2019; 80:10-15. [PMID: 30851619 DOI: 10.1016/j.compbiolchem.2019.01.005] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/08/2018] [Revised: 12/30/2018] [Accepted: 01/17/2019] [Indexed: 01/21/2023]
Abstract
Sequence comparison is an important topic in bioinformatics. With the exponential increase of biological sequences, the traditional protein sequence comparison methods - the alignment methods become limited, so the alignment-free methods are widely proposed in the past two decades. In this paper, we considered not only the six typical physicochemical properties of amino acids, but also their frequency and positional distribution. A 51-dimensional vector was obtained to describe the protein sequence. We got a pairwise distance matrix by computing the standardized Euclidean distance, and discriminant analysis and phylogenetic analysis can be made. The results on the Influenza A virus and ND5 datasets indicate that our method is accurate and efficient for classifying proteins and inferring the phylogeny of species.
Collapse
Affiliation(s)
- Yunxiu Zhao
- College of Science, Northwest A&F University, Yangling, Shaanxi 712100, PR China
| | - Xiaolong Xue
- College of Science, Northwest A&F University, Yangling, Shaanxi 712100, PR China
| | - Xiaoli Xie
- College of Science, Northwest A&F University, Yangling, Shaanxi 712100, PR China.
| |
Collapse
|
12
|
Yu X, Yang D, Guo C, Gao L. Plant phylogenomics based on genome-partitioning strategies: Progress and prospects. PLANT DIVERSITY 2018; 40:158-164. [PMID: 30740560 PMCID: PMC6137260 DOI: 10.1016/j.pld.2018.06.005] [Citation(s) in RCA: 27] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 04/24/2018] [Revised: 06/26/2018] [Accepted: 06/27/2018] [Indexed: 05/26/2023]
Abstract
The rapid expansion of next-generation sequencing (NGS) has generated a powerful array of approaches to address fundamental questions in biology. Several genome-partitioning strategies to sequence selected subsets of the genome have emerged in the fields of phylogenomics and evolutionary genomics. In this review, we summarize the applications, advantages and limitations of four NGS-based genome-partitioning approaches in plant phylogenomics: genome skimming, transcriptome sequencing (RNA-seq), restriction site associated DNA sequencing (RAD-Seq), and targeted capture (Hyb-seq). Of these four genome-partitioning approaches, targeted capture (especially Hyb-seq) shows the greatest promise for plant phylogenetics over the next few years. This review will aid researchers in their selection of appropriate genome-partitioning approaches to address questions of evolutionary scale, where we anticipate continued development and expansion of whole-genome sequencing strategies in the fields of plant phylogenomics and evolutionary biology research.
Collapse
Affiliation(s)
- Xiangqin Yu
- Key Laboratory for Plant Diversity and Biogeography of East Asia, Kunming Institute of Botany, Chinese Academy of Sciences, Kunming, Yunnan, 650201, China
| | - Dan Yang
- Key Laboratory for Plant Diversity and Biogeography of East Asia, Kunming Institute of Botany, Chinese Academy of Sciences, Kunming, Yunnan, 650201, China
- Kunming College of Life Science, University of Chinese Academy of Sciences, Kunming, Yunnan, 650201, China
| | - Cen Guo
- Germplasm Bank of Wild Species, Kunming Institute of Botany, Chinese Academy of Sciences, Kunming, Yunnan, 650201, China
- Kunming College of Life Science, University of Chinese Academy of Sciences, Kunming, Yunnan, 650201, China
| | - Lianming Gao
- Key Laboratory for Plant Diversity and Biogeography of East Asia, Kunming Institute of Botany, Chinese Academy of Sciences, Kunming, Yunnan, 650201, China
| |
Collapse
|
13
|
Yu X, Reva ON. SWPhylo - A Novel Tool for Phylogenomic Inferences by Comparison of Oligonucleotide Patterns and Integration of Genome-Based and Gene-Based Phylogenetic Trees. Evol Bioinform Online 2018; 14:1176934318759299. [PMID: 29511354 PMCID: PMC5826093 DOI: 10.1177/1176934318759299] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/14/2017] [Accepted: 01/24/2018] [Indexed: 11/17/2022] Open
Abstract
Modern phylogenetic studies may benefit from the analysis of complete genome sequences of various microorganisms. Evolutionary inferences based on genome-scale analysis are believed to be more accurate than the gene-based alternative. However, the computational complexity of current phylogenomic procedures, inappropriateness of standard phylogenetic tools to process genome-wide data, and lack of reliable substitution models which correlates with alignment-free phylogenomic approaches deter microbiologists from using these opportunities. For example, the super-matrix and super-tree approaches of phylogenomics use multiple integrated genomic loci or individual gene-based trees to infer an overall consensus tree. However, these approaches potentially multiply errors of gene annotation and sequence alignment not mentioning the computational complexity and laboriousness of the methods. In this article, we demonstrate that the annotation- and alignment-free comparison of genome-wide tetranucleotide frequencies, termed oligonucleotide usage patterns (OUPs), allowed a fast and reliable inference of phylogenetic trees. These were congruent to the corresponding whole genome super-matrix trees in terms of tree topology when compared with other known approaches including 16S ribosomal RNA and GyrA protein sequence comparison, complete genome-based MAUVE, and CVTree methods. A Web-based program to perform the alignment-free OUP-based phylogenomic inferences was implemented at http://swphylo.bi.up.ac.za/. Applicability of the tool was tested on different taxa from subspecies to intergeneric levels. Distinguishing between closely related taxonomic units may be enforced by providing the program with alignments of marker protein sequences, eg, GyrA.
Collapse
Affiliation(s)
- Xiaoyu Yu
- Department of Biochemistry, Centre for Bioinformatics and Computational Biology, University of Pretoria, Pretoria, South Africa
| | - Oleg N Reva
- Department of Biochemistry, Centre for Bioinformatics and Computational Biology, University of Pretoria, Pretoria, South Africa
| |
Collapse
|
14
|
Zielezinski A, Vinga S, Almeida J, Karlowski WM. Alignment-free sequence comparison: benefits, applications, and tools. Genome Biol 2017; 18:186. [PMID: 28974235 PMCID: PMC5627421 DOI: 10.1186/s13059-017-1319-7] [Citation(s) in RCA: 239] [Impact Index Per Article: 34.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/30/2023] Open
Abstract
Alignment-free sequence analyses have been applied to problems ranging from whole-genome phylogeny to the classification of protein families, identification of horizontally transferred genes, and detection of recombined sequences. The strength of these methods makes them particularly useful for next-generation sequencing data processing and analysis. However, many researchers are unclear about how these methods work, how they compare to alignment-based methods, and what their potential is for use for their research. We address these questions and provide a guide to the currently available alignment-free sequence analysis tools.
Collapse
Affiliation(s)
- Andrzej Zielezinski
- Department of Computational Biology, Faculty of Biology, Adam Mickiewicz University in Poznan, Umultowska 89, 61-614, Poznan, Poland
| | - Susana Vinga
- IDMEC, Instituto Superior Técnico, Universidade de Lisboa, Av. Rovisco Pais 1, 1049-001, Lisbon, Portugal
| | - Jonas Almeida
- Stony Brook University (SUNY), 101 Nicolls Road, Stony Brook, NY, 11794, USA
| | - Wojciech M Karlowski
- Department of Computational Biology, Faculty of Biology, Adam Mickiewicz University in Poznan, Umultowska 89, 61-614, Poznan, Poland.
| |
Collapse
|
15
|
Murray KD, Webers C, Ong CS, Borevitz J, Warthmann N. kWIP: The k-mer weighted inner product, a de novo estimator of genetic similarity. PLoS Comput Biol 2017; 13:e1005727. [PMID: 28873405 PMCID: PMC5600398 DOI: 10.1371/journal.pcbi.1005727] [Citation(s) in RCA: 31] [Impact Index Per Article: 4.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/04/2016] [Revised: 09/15/2017] [Accepted: 08/21/2017] [Indexed: 11/18/2022] Open
Abstract
Modern genomics techniques generate overwhelming quantities of data. Extracting population genetic variation demands computationally efficient methods to determine genetic relatedness between individuals (or “samples”) in an unbiased manner, preferably de novo. Rapid estimation of genetic relatedness directly from sequencing data has the potential to overcome reference genome bias, and to verify that individuals belong to the correct genetic lineage before conclusions are drawn using mislabelled, or misidentified samples. We present the k-mer Weighted Inner Product (kWIP), an assembly-, and alignment-free estimator of genetic similarity. kWIP combines a probabilistic data structure with a novel metric, the weighted inner product (WIP), to efficiently calculate pairwise similarity between sequencing runs from their k-mer counts. It produces a distance matrix, which can then be further analysed and visualised. Our method does not require prior knowledge of the underlying genomes and applications include establishing sample identity and detecting mix-up, non-obvious genomic variation, and population structure. We show that kWIP can reconstruct the true relatedness between samples from simulated populations. By re-analysing several published datasets we show that our results are consistent with marker-based analyses. kWIP is written in C++, licensed under the GNU GPL, and is available from https://github.com/kdmurray91/kwip.
Collapse
Affiliation(s)
- Kevin D. Murray
- Research School of Biology, The Australian National University, Canberra, Australia
- * E-mail: (KDM); (NW)
| | - Christfried Webers
- Data61, CSIRO, Canberra, Australia
- Research School of Computer Science, The Australian National University, Canberra, Australia
| | - Cheng Soon Ong
- Data61, CSIRO, Canberra, Australia
- Research School of Computer Science, The Australian National University, Canberra, Australia
| | - Justin Borevitz
- Research School of Biology, The Australian National University, Canberra, Australia
| | - Norman Warthmann
- Research School of Biology, The Australian National University, Canberra, Australia
- * E-mail: (KDM); (NW)
| |
Collapse
|
16
|
Abstract
Fungi belong to one of the largest and most diverse kingdoms of living organisms. The evolutionary kinship within a fungal population has so far been inferred mostly from the gene-information-based trees ("gene trees"), constructed commonly based on the degree of differences of proteins or DNA sequences of a small number of highly conserved genes common among the population by a multiple sequence alignment (MSA) method. Since each gene evolves under different evolutionary pressure and time scale, it has been known that one gene tree for a population may differ from other gene trees for the same population depending on the subjective selection of the genes. Within the last decade, a large number of whole-genome sequences of fungi have become publicly available, which represent, at present, the most fundamental and complete information about each fungal organism. This presents an opportunity to infer kinship among fungi using a whole-genome information-based tree ("genome tree"). The method we used allows comparison of whole-genome information without MSA, and is a variation of a computational algorithm developed to find semantic similarities or plagiarism in two books, where we represent whole-genomic information of an organism as a book of words without spaces. The genome tree reveals several significant and notable differences from the gene trees, and these differences invoke new discussions about alternative narratives for the evolution of some of the currently accepted fungal groups.
Collapse
Affiliation(s)
- JaeJin Choi
- Department of Chemistry, University of California, Berkeley, CA 94720
- Molecular Biophysics & Integrated Bioimaging Division, Lawrence Berkeley National Laboratory, Berkeley, CA 94720
- Department of Integrated Omics for Biomedical Sciences, Yonsei University, Seoul 03722, Republic of Korea
- Korea Research Institute of Bioscience and Biotechnology, Daejeon 34141, Republic of Korea
| | - Sung-Hou Kim
- Department of Chemistry, University of California, Berkeley, CA 94720;
- Molecular Biophysics & Integrated Bioimaging Division, Lawrence Berkeley National Laboratory, Berkeley, CA 94720
- Department of Integrated Omics for Biomedical Sciences, Yonsei University, Seoul 03722, Republic of Korea
- Center for Computational Biology, University of California, Berkeley, CA 94720
| |
Collapse
|
17
|
Seo H, Cho DH. A new alignment free genome comparison algorithm based on statistically estimated feature frequency profile. ANNUAL INTERNATIONAL CONFERENCE OF THE IEEE ENGINEERING IN MEDICINE AND BIOLOGY SOCIETY. IEEE ENGINEERING IN MEDICINE AND BIOLOGY SOCIETY. ANNUAL INTERNATIONAL CONFERENCE 2017; 2017:4265-4268. [PMID: 29060839 DOI: 10.1109/embc.2017.8037798] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/07/2023]
Abstract
The sequence comparison is an important part in bioinformatics to understand the biological property of genome. Although the alignment based sequence comparison is traditional and reliable algorithm, alignment free methods have been actively researched because of their advantage in terms of computational complexity. In this paper, we suggest a new alignment free genome comparison scheme based on statistical approach. From sequence components, word frequency information of the sequence is estimated. By investigating the relationship between estimated frequency information and actual word frequency, the characteristics of the sequence are numerically represented. The phylogenetic tree and the sequence classification of mammalian sequences are provided to reveal the remarkable performance of our statistical algorithm.
Collapse
|
18
|
Biase FH. Oocyte Developmental Competence: Insights from Cross-Species Differential Gene Expression and Human Oocyte-Specific Functional Gene Networks. ACTA ACUST UNITED AC 2017; 21:156-168. [DOI: 10.1089/omi.2016.0177] [Citation(s) in RCA: 19] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022]
|
19
|
Viral Phylogenomics Using an Alignment-Free Method: A Three-Step Approach to Determine Optimal Length of k-mer. Sci Rep 2017; 7:40712. [PMID: 28102365 PMCID: PMC5244389 DOI: 10.1038/srep40712] [Citation(s) in RCA: 26] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/05/2016] [Accepted: 12/08/2016] [Indexed: 11/25/2022] Open
Abstract
The development of rapid, economical genome sequencing has shed new light on the classification of viruses. As of October 2016, the National Center for Biotechnology Information (NCBI) database contained >2 million viral genome sequences and a reference set of ~4000 viral genome sequences that cover a wide range of known viral families. Whole-genome sequences can be used to improve viral classification and provide insight into the viral “tree of life”. However, due to the lack of evolutionary conservation amongst diverse viruses, it is not feasible to build a viral tree of life using traditional phylogenetic methods based on conserved proteins. In this study, we used an alignment-free method that uses k-mers as genomic features for a large-scale comparison of complete viral genomes available in RefSeq. To determine the optimal feature length, k (an essential step in constructing a meaningful dendrogram), we designed a comprehensive strategy that combines three approaches: (1) cumulative relative entropy, (2) average number of common features among genomes, and (3) the Shannon diversity index. This strategy was used to determine k for all 3,905 complete viral genomes in RefSeq. The resulting dendrogram shows consistency with the viral taxonomy of the ICTV and the Baltimore classification of viruses.
Collapse
|
20
|
Varki NM, Varki A. On the apparent rarity of epithelial cancers in captive chimpanzees. Philos Trans R Soc Lond B Biol Sci 2016; 370:rstb.2014.0225. [PMID: 26056369 DOI: 10.1098/rstb.2014.0225] [Citation(s) in RCA: 23] [Impact Index Per Article: 2.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/13/2022] Open
Abstract
Malignant neoplasms arising from epithelial cells are called carcinomas. Such cancers are diagnosed in about one in three humans in 'developed' countries, with the most common sites affected being lung, breast, prostate, colon, ovary and pancreas. By contrast, carcinomas are said to be rare in captive chimpanzees, which share more than 99% protein sequence homology with humans (and possibly in other related 'great apes'-bonobos, gorillas and orangutans). Simple ascertainment bias is an unlikely explanation, as these nonhuman hominids are recipients of excellent veterinary care in research facilities and zoos, and are typically subjected to necropsies when they die. In keeping with this notion, benign tumours and cancers that are less common in humans are well documented in this population. In this brief overview, we discuss other possible explanations for the reported rarity of carcinomas in our closest evolutionary cousins, including inadequacy of numbers surveyed, differences in life expectancy, diet, genetic susceptibility, immune responses or their microbiomes, and other potential environmental factors. We conclude that while relative carcinoma risk is a likely difference between humans and chimpanzees (and possibly other 'great apes'), a more systematic survey of available data is required for validation of this claim.
Collapse
Affiliation(s)
- Nissi M Varki
- Department of Pathology, Center for Academic Research and Training in Anthropogeny (CARTA), University of California, San Diego, La Jolla, CA 92093, USA
| | - Ajit Varki
- Department of Pathology, Center for Academic Research and Training in Anthropogeny (CARTA), University of California, San Diego, La Jolla, CA 92093, USA Department of Medicine, Center for Academic Research and Training in Anthropogeny (CARTA), University of California, San Diego, La Jolla, CA 92093, USA Department of Cellular and Molecular Medicine, Center for Academic Research and Training in Anthropogeny (CARTA), University of California, San Diego, La Jolla, CA 92093, USA
| |
Collapse
|
21
|
Paci G, Cristadoro G, Monti B, Lenci M, Degli Esposti M, Castellani GC, Remondini D. Characterization of DNA methylation as a function of biological complexity via dinucleotide inter-distances. PHILOSOPHICAL TRANSACTIONS. SERIES A, MATHEMATICAL, PHYSICAL, AND ENGINEERING SCIENCES 2016; 374:rsta.2015.0227. [PMID: 26857665 DOI: 10.1098/rsta.2015.0227] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Accepted: 11/23/2015] [Indexed: 06/05/2023]
Abstract
We perform a statistical study of the distances between successive occurrences of a given dinucleotide in the DNA sequence for a number of organisms of different complexity. Our analysis highlights peculiar features of the CG dinucleotide distribution in mammalian DNA, pointing towards a connection with the role of such dinucleotide in DNA methylation. While the CG distributions of mammals exhibit exponential tails with comparable parameters, the picture for the other organisms studied (e.g. fish, insects, bacteria and viruses) is more heterogeneous, possibly because in these organisms DNA methylation has different functional roles. Our analysis suggests that the distribution of the distances between CG dinucleotides provides useful insights into characterizing and classifying organisms in terms of methylation functionalities.
Collapse
Affiliation(s)
- Giulia Paci
- Department of Physics and Astronomy, University of Bologna, Viale B. Pichat 6/2, Bologna 40127, Italy
| | - Giampaolo Cristadoro
- Department of Mathematics, University of Bologna, Piazza di Porta S. Donato 5, Bologna 40126, Italy
| | - Barbara Monti
- Department of Pharmacy and Biotechnology, University of Bologna, Via S. Donato 15, Bologna 40127, Italy
| | - Marco Lenci
- Department of Mathematics, University of Bologna, Piazza di Porta S. Donato 5, Bologna 40126, Italy Bologna Unit, INFN, Viale B. Pichat 6/2, Bologna 40127, Italy
| | - Mirko Degli Esposti
- Department of Mathematics, University of Bologna, Piazza di Porta S. Donato 5, Bologna 40126, Italy
| | - Gastone C Castellani
- Department of Physics and Astronomy, University of Bologna, Viale B. Pichat 6/2, Bologna 40127, Italy Bologna Unit, INFN, Viale B. Pichat 6/2, Bologna 40127, Italy
| | - Daniel Remondini
- Department of Physics and Astronomy, University of Bologna, Viale B. Pichat 6/2, Bologna 40127, Italy Bologna Unit, INFN, Viale B. Pichat 6/2, Bologna 40127, Italy
| |
Collapse
|
22
|
Abstract
Fifty complete Bacillus genome sequences and associated plasmids were compared using the “feature frequency profile” (FFP) method. The resulting whole-genome phylogeny supports the placement of three Bacillus species (B. thuringiensis, B. anthracis and B. cereus) as a single clade. The monophyletic status of B. anthracis was strongly supported by the analysis. FFP proved to be more effective in inferring the phylogeny of Bacillus than methods based on single gene sequences [16s rRNA gene, GryB (gyrase subunit B) and AroE (shikimate-5-dehydrogenase)] analyses. The findings of FFP analysis were verified using kSNP v2 (alignment-free sequence analysis method) and Harvest suite (core genome sequence alignment method).
Collapse
|
23
|
Wen J, Zhang Y, Yau SS. k-mer Sparse matrix model for genetic sequence and its applications in sequence comparison. J Theor Biol 2014; 363:145-50. [DOI: 10.1016/j.jtbi.2014.08.028] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/06/2014] [Revised: 07/14/2014] [Accepted: 08/17/2014] [Indexed: 10/24/2022]
|
24
|
A novel k-word relative measure for sequence comparison. Comput Biol Chem 2014; 53PB:331-338. [PMID: 25462340 DOI: 10.1016/j.compbiolchem.2014.10.007] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/26/2014] [Revised: 08/10/2014] [Accepted: 10/25/2014] [Indexed: 12/28/2022]
Abstract
In order to extract phylogenetic information from DNA sequences, the new normalized k-word average relative distance is proposed in this paper. The proposed measure was tested by discriminate analysis and phylogenetic analysis. The phylogenetic trees based on the Manhattan distance measure are reconstructed with k ranging from 1 to 12. At the same time, a new method is suggested to reduce the matrix dimension, can greatly lessen the amount of calculation and operation time. The experimental assessment demonstrated that our measure was efficient. What's more, comparing with other methods' results shows that our method is feasible and powerful for phylogenetic analysis.
Collapse
|
25
|
Prabha R, Singh DP, Gupta SK, Rai A. Whole genome phylogeny of Prochlorococcus marinus group of cyanobacteria: genome alignment and overlapping gene approach. Interdiscip Sci 2014; 6:149-57. [PMID: 25172453 DOI: 10.1007/s12539-013-0024-9] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/02/2013] [Revised: 10/21/2013] [Accepted: 01/10/2014] [Indexed: 11/29/2022]
Abstract
Prochlorococcus is the smallest known oxygenic phototrophic marine cyanobacterium dominating the mid-latitude oceans. Physiologically and genetically distinct P. marinus isolates from many oceans in the world were assigned two different groups, a tightly clustered high-light (HL)-adapted and a divergent low-light (LL-) adapted clade. Phylogenetic analysis of this cyanobacterium on the basis of 16S rRNA and other conserved genes did not show consistency with its phenotypic behavior. We analyzed phylogeny of this genus on the basis of complete genome sequences through genome alignment, overlapping-gene content and gene-order approach. Phylogenetic tree of P. marinus obtained by comparing whole genome sequences in contrast to that based on 16S rRNA gene, corresponded well with the HL/LL ecotypic distinction of twelve strains and showed consistency with phenotypic classification of P. marinus. Evidence for the horizontal descent and acquisition of genes within and across the genus was observed. Many genes involved in metabolic functions were found to be conserved across these genomes and many were continuously gained by different strains as per their needs during the course of their evolution. Consistency in the physiological and genetic phylogeny based on whole genome sequence is established. These observations improve our understanding about the adaptation and diversification of these organisms under evolutionary pressure.
Collapse
Affiliation(s)
- Ratna Prabha
- National Bureau of Agriculturally Important Microorganisms, Indian Council of Agricultural Research, Kushmaur, Maunath Bhanjan, 275103, India
| | | | | | | |
Collapse
|
26
|
King BR, Aburdene M, Thompson A, Warres Z. Application of discrete Fourier inter-coefficient difference for assessing genetic sequence similarity. EURASIP JOURNAL ON BIOINFORMATICS & SYSTEMS BIOLOGY 2014; 2014:8. [PMID: 24991213 PMCID: PMC4077688 DOI: 10.1186/1687-4153-2014-8] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 12/16/2013] [Accepted: 05/01/2014] [Indexed: 11/27/2022]
Abstract
Digital signal processing (DSP) techniques for biological sequence analysis continue to grow in popularity due to the inherent digital nature of these sequences. DSP methods have demonstrated early success for detection of coding regions in a gene. Recently, these methods are being used to establish DNA gene similarity. We present the inter-coefficient difference (ICD) transformation, a novel extension of the discrete Fourier transformation, which can be applied to any DNA sequence. The ICD method is a mathematical, alignment-free DNA comparison method that generates a genetic signature for any DNA sequence that is used to generate relative measures of similarity among DNA sequences. We demonstrate our method on a set of insulin genes obtained from an evolutionarily wide range of species, and on a set of avian influenza viral sequences, which represents a set of highly similar sequences. We compare phylogenetic trees generated using our technique against trees generated using traditional alignment techniques for similarity and demonstrate that the ICD method produces a highly accurate tree without requiring an alignment prior to establishing sequence similarity.
Collapse
Affiliation(s)
- Brian R King
- Department of Computer Science, Bucknell University, Lewisburg, PA 17837, USA
| | - Maurice Aburdene
- Department of Electrical and Computer Engineering, Bucknell University, Lewisburg, PA 17837, USA
| | - Alex Thompson
- Department of Electrical and Computer Engineering, Bucknell University, Lewisburg, PA 17837, USA
| | - Zach Warres
- Department of Electrical and Computer Engineering, Bucknell University, Lewisburg, PA 17837, USA
| |
Collapse
|
27
|
K-mer natural vector and its application to the phylogenetic analysis of genetic sequences. Gene 2014; 546:25-34. [PMID: 24858075 DOI: 10.1016/j.gene.2014.05.043] [Citation(s) in RCA: 35] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/08/2014] [Revised: 05/04/2014] [Accepted: 05/20/2014] [Indexed: 11/21/2022]
Abstract
Based on the well-known k-mer model, we propose a k-mer natural vector model for representing a genetic sequence based on the numbers and distributions of k-mers in the sequence. We show that there exists a one-to-one correspondence between a genetic sequence and its associated k-mer natural vector. The k-mer natural vector method can be easily and quickly used to perform phylogenetic analysis of genetic sequences without requiring evolutionary models or human intervention. Whole or partial genomes can be handled more effective with our proposed method. It is applied to the phylogenetic analysis of genetic sequences, and the obtaining results fully demonstrate that the k-mer natural vector method is a very powerful tool for analysing and annotating genetic sequences and determining evolutionary relationships both in terms of accuracy and efficiency.
Collapse
|
28
|
Abstract
The gene order on the X chromosome of eutherians is generally highly conserved, although an increase in the rate of rearrangement has been reported in the rodent lineage. Conservation of the X chromosome is thought to be caused by selection related to maintenance of dosage compensation. However, we herein reveal that the cattle (Btau4.0) lineage has experienced a strong increase in the rate of X-chromosome rearrangement, much stronger than that previously reported for rodents. We also show that this increase is not matched by a similar increase on the autosomes and cannot be explained by assembly errors. Furthermore, we compared the difference in two cattle genome assemblies: Btau4.0 and Btau6.0 (Bos taurus UMD3.1). The results showed a discrepancy between Btau4.0 and Btau6.0 cattle assembly version data, and we believe that Btau6.0 cattle assembly version data are not more reliable than Btau4.0. [BMB Reports 2013; 46(6): 310-315]
Collapse
Affiliation(s)
- Woncheoul Park
- Department of Agricultural Biotechnology and Research Institute for Agriculture and Life Sciences, Seoul National University, Seoul 151-742, Korea
| | | | | |
Collapse
|
29
|
Yu HJ. Segmented K-mer and its application on similarity analysis of mitochondrial genome sequences. Gene 2013; 518:419-24. [DOI: 10.1016/j.gene.2012.12.079] [Citation(s) in RCA: 17] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/13/2012] [Revised: 12/01/2012] [Accepted: 12/19/2012] [Indexed: 11/25/2022]
|
30
|
Comin M, Verzotto D. Alignment-free phylogeny of whole genomes using underlying subwords. Algorithms Mol Biol 2012; 7:34. [PMID: 23216990 PMCID: PMC3549825 DOI: 10.1186/1748-7188-7-34] [Citation(s) in RCA: 48] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/19/2012] [Accepted: 11/29/2012] [Indexed: 11/24/2022] Open
Abstract
Background With the progress of modern sequencing technologies a large number of complete genomes are now available. Traditionally the comparison of two related genomes is carried out by sequence alignment. There are cases where these techniques cannot be applied, for example if two genomes do not share the same set of genes, or if they are not alignable to each other due to low sequence similarity, rearrangements and inversions, or more specifically to their lengths when the organisms belong to different species. For these cases the comparison of complete genomes can be carried out only with ad hoc methods that are usually called alignment-free methods. Methods In this paper we propose a distance function based on subword compositions called Underlying Approach (UA). We prove that the matching statistics, a popular concept in the field of string algorithms able to capture the statistics of common words between two sequences, can be derived from a small set of “independent” subwords, namely the irredundant common subwords. We define a distance-like measure based on these subwords, such that each region of genomes contributes only once, thus avoiding to count shared subwords a multiple number of times. In a nutshell, this filter discards subwords occurring in regions covered by other more significant subwords. Results The Underlying Approach (UA) builds a scoring function based on this set of patterns, called underlying. We prove that this set is by construction linear in the size of input, without overlaps, and can be efficiently constructed. Results show the validity of our method in the reconstruction of phylogenetic trees, where the Underlying Approach outperforms the current state of the art methods. Moreover, we show that the accuracy of UA is achieved with a very small number of subwords, which in some cases carry meaningful biological information. Availability http://www.dei.unipd.it/∼ciompin/main/underlying.html
Collapse
|
31
|
Abstract
Since the emergence of high-throughput genome sequencing platforms and more recently the next-generation platforms, the genome databases are growing at an astronomical rate. Tremendous efforts have been invested in recent years in understanding intriguing complexities beneath the vast ocean of genomic data. This is apparent in the spurt of computational methods for interpreting these data in the past few years. Genomic data interpretation is notoriously difficult, partly owing to the inherent heterogeneities appearing at different scales. Methods developed to interpret these data often suffer from their inability to adequately measure the underlying heterogeneities and thus lead to confounding results. Here, we present an information entropy-based approach that unravels the distinctive patterns underlying genomic data efficiently and thus is applicable in addressing a variety of biological problems. We show the robustness and consistency of the proposed methodology in addressing three different biological problems of significance—identification of alien DNAs in bacterial genomes, detection of structural variants in cancer cell lines and alignment-free genome comparison.
Collapse
Affiliation(s)
- Rajeev K Azad
- Department of Biological Sciences, University of Pittsburgh, Pittsburgh, PA 15260, USA.
| | | |
Collapse
|
32
|
Cohen E, Chor B. Detecting Phylogenetic Signals in Eukaryotic Whole Genome Sequences. J Comput Biol 2012; 19:945-56. [DOI: 10.1089/cmb.2012.0122] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Affiliation(s)
- Eyal Cohen
- School of Computer Science, Tel-Aviv University, Tel-Aviv, Israel
| | - Benny Chor
- School of Computer Science, Tel-Aviv University, Tel-Aviv, Israel
| |
Collapse
|
33
|
BROWN JASONL, KNOWLES LLACEY. Spatially explicit models of dynamic histories: examination of the genetic consequences of Pleistocene glaciation and recent climate change on the American Pika. Mol Ecol 2012; 21:3757-75. [DOI: 10.1111/j.1365-294x.2012.05640.x] [Citation(s) in RCA: 59] [Impact Index Per Article: 4.9] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
|
34
|
Rubin BER, Ree RH, Moreau CS. Inferring phylogenies from RAD sequence data. PLoS One 2012; 7:e33394. [PMID: 22493668 PMCID: PMC3320897 DOI: 10.1371/journal.pone.0033394] [Citation(s) in RCA: 208] [Impact Index Per Article: 17.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/13/2011] [Accepted: 02/14/2012] [Indexed: 11/24/2022] Open
Abstract
Reduced-representation genome sequencing represents a new source of data for systematics, and its potential utility in interspecific phylogeny reconstruction has not yet been explored. One approach that seems especially promising is the use of inexpensive short-read technologies (e.g., Illumina, SOLiD) to sequence restriction-site associated DNA (RAD)--the regions of the genome that flank the recognition sites of restriction enzymes. In this study, we simulated the collection of RAD sequences from sequenced genomes of different taxa (Drosophila, mammals, and yeasts) and developed a proof-of-concept workflow to test whether informative data could be extracted and used to accurately reconstruct "known" phylogenies of species within each group. The workflow consists of three basic steps: first, sequences are clustered by similarity to estimate orthology; second, clusters are filtered by taxonomic coverage; and third, they are aligned and concatenated for "total evidence" phylogenetic analysis. We evaluated the performance of clustering and filtering parameters by comparing the resulting topologies with well-supported reference trees and we were able to identify conditions under which the reference tree was inferred with high support. For Drosophila, whole genome alignments allowed us to directly evaluate which parameters most consistently recovered orthologous sequences. For the parameter ranges explored, we recovered the best results at the low ends of sequence similarity and taxonomic representation of loci; these generated the largest supermatrices with the highest proportion of missing data. Applications of the method to mammals and yeasts were less successful, which we suggest may be due partly to their much deeper evolutionary divergence times compared to Drosophila (crown ages of approximately 100 and 300 versus 60 Mya, respectively). RAD sequences thus appear to hold promise for reconstructing phylogenetic relationships in younger clades in which sufficient numbers of orthologous restriction sites are retained across species.
Collapse
Affiliation(s)
- Benjamin E R Rubin
- Committee on Evolutionary Biology, University of Chicago, Chicago, Illinois, United States of America.
| | | | | |
Collapse
|
35
|
Hatje K, Kollmar M. A phylogenetic analysis of the brassicales clade based on an alignment-free sequence comparison method. FRONTIERS IN PLANT SCIENCE 2012; 3:192. [PMID: 22952468 PMCID: PMC3429886 DOI: 10.3389/fpls.2012.00192] [Citation(s) in RCA: 28] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 05/25/2012] [Accepted: 08/06/2012] [Indexed: 05/06/2023]
Abstract
Phylogenetic analyses reveal the evolutionary derivation of species. A phylogenetic tree can be inferred from multiple sequence alignments of proteins or genes. The alignment of whole genome sequences of higher eukaryotes is a computational intensive and ambitious task as is the computation of phylogenetic trees based on these alignments. To overcome these limitations, we here used an alignment-free method to compare genomes of the Brassicales clade. For each nucleotide sequence a Chaos Game Representation (CGR) can be computed, which represents each nucleotide of the sequence as a point in a square defined by the four nucleotides as vertices. Each CGR is therefore a unique fingerprint of the underlying sequence. If the CGRs are divided by grid lines each grid square denotes the occurrence of oligonucleotides of a specific length in the sequence (Frequency Chaos Game Representation, FCGR). Here, we used distance measures between FCGRs to infer phylogenetic trees of Brassicales species. Three types of data were analyzed because of their different characteristics: (A) Whole genome assemblies as far as available for species belonging to the Malvidae taxon. (B) EST data of species of the Brassicales clade. (C) Mitochondrial genomes of the Rosids branch, a supergroup of the Malvidae. The trees reconstructed based on the Euclidean distance method are in general agreement with single gene trees. The Fitch-Margoliash and Neighbor joining algorithms resulted in similar to identical trees. Here, for the first time we have applied the bootstrap re-sampling concept to trees based on FCGRs to determine the support of the branchings. FCGRs have the advantage that they are fast to calculate, and can be used as additional information to alignment based data and morphological characteristics to improve the phylogenetic classification of species in ambiguous cases.
Collapse
Affiliation(s)
- Klas Hatje
- Abteilung NMR-Basierte Strukturbiologie, Max-Planck-Institut für Biophysikalische ChemieGöttingen, Germany
| | - Martin Kollmar
- Abteilung NMR-Basierte Strukturbiologie, Max-Planck-Institut für Biophysikalische ChemieGöttingen, Germany
- *Correspondence: Martin Kollmar, Abteilung NMR-Basierte Strukturbiologie, Max-Planck-Institut für Biophysikalische Chemie, Am Fassberg 11, D-37077 Göttingen, Germany. e-mail:
| |
Collapse
|
36
|
Cheung MK, Li L, Nong W, Kwan HS. 2011 German Escherichia coli O104:H4 outbreak: whole-genome phylogeny without alignment. BMC Res Notes 2011; 4:533. [PMID: 22166159 DOI: 10.1186/1756-0500-4-533] [Citation(s) in RCA: 19] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/06/2011] [Accepted: 12/13/2011] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND A large-scale Escherichia coli O104:H4 outbreak occurred in Germany from May to July 2011, causing numerous cases of hemolytic-uremic syndrome (HUS) and deaths. Genomes of ten outbreak isolates and a historical O104:H4 strain isolated in 2001 were sequenced using different new generation sequencing platforms. Phylogenetic analyses were performed using various approaches which either are not genome-wide or may be subject to errors due to poor sequence alignment. Also, detailed pathogenicity analyses on the 2001 strain were not available. FINDINGS We reconstructed the phylogeny of E. coli using the genome-wide and alignment-free feature frequency profile method and revealed the 2001 strain to be the closest relative to the 2011 outbreak strain among all available E. coli strains at present and confirmed findings from previous alignment-based phylogenetic studies that the HUS-causing O104:H4 strains are more closely related to typical enteroaggregative E. coli (EAEC) than to enterohemorrhagic E. coli. Detailed re-examination of pathogenicity-related virulence factors and secreted proteins showed that the 2001 strain possesses virulence factors shared between typical EAEC and the 2011 outbreak strain. CONCLUSIONS Our study represents the first attempt to elucidate the whole-genome phylogeny of the 2011 German outbreak using an alignment-free method, and suggested a direct line of ancestry leading from a putative EAEC-like ancestor through the 2001 strain to the 2011 outbreak strain.
Collapse
Affiliation(s)
- Man Kit Cheung
- School of Life Sciences, The Chinese University of Hong Kong, Shatin, New Territories, Hong Kong SAR, China.
| | | | | | | |
Collapse
|
37
|
Devillers H, Schbath S. Separating significant matches from spurious matches in DNA sequences. J Comput Biol 2011; 19:1-12. [PMID: 22149632 DOI: 10.1089/cmb.2011.0070] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
Word matches are widely used to compare genomic sequences. Complete genome alignment methods often rely on the use of matches as anchors for building their alignments, and various alignment-free approaches that characterize similarities between large sequences are based on word matches. Among matches that are retrieved from the comparison of two genomic sequences, a part of them may correspond to spurious matches (SMs), which are matches obtained by chance rather than by homologous relationships. The number of SMs depends on the minimal match length (ℓ) that has to be set in the algorithm used to retrieve them. Indeed, if ℓ is too small, a lot of matches are recovered but most of them are SMs. Conversely, if ℓ is too large, fewer matches are retrieved but many smaller significant matches are certainly ignored. To date, the choice of ℓ mostly depends on empirical threshold values rather than robust statistical methods. To overcome this problem, we propose a statistical approach based on the use of a mixture model of geometric distributions to characterize the distribution of the length of matches obtained from the comparison of two genomic sequences.
Collapse
Affiliation(s)
- Hugo Devillers
- INRA, UR1077, Mathématique, Informatique, et Génome, Jouy-en-Josas, France.
| | | |
Collapse
|
38
|
Pandit A, Dasanna AK, Sinha S. Multifractal analysis of HIV-1 genomes. Mol Phylogenet Evol 2011; 62:756-63. [PMID: 22155711 DOI: 10.1016/j.ympev.2011.11.017] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/23/2010] [Revised: 10/29/2011] [Accepted: 11/18/2011] [Indexed: 10/14/2022]
Abstract
Pathogens like HIV-1, which evolve into many closely related variants displaying differential infectivity and evolutionary dynamics in a short time scale, require fast and accurate classification. Conventional whole genome sequence alignment-based methods are computationally expensive and involve complex analysis. Alignment-free methodologies are increasingly being used to effectively differentiate genomic variations between viral species. Multifractal analysis, which explores the self-similar nature of genomes, is an alignment-free methodology that has been applied to study such variations. However, whether multifractal analysis can quantify variations between closely related genomes, such as the HIV-1 subtypes, is an open question. Here we address the above by implementing the multifractal analysis on four retroviral genomes (HIV-1, HIV-2, SIVcpz, and HTLV-1), and demonstrate that individual multifractal properties can differentiate between different retrovirus types easily. However, the individual multifractal measures do not resolve within-group variations for different known subtypes of HIV-1 M group. We show here that these known subtypes can instead be classified correctly using a combination of the crucial multifractal measures. This method is simple and computationally fast in comparison to the conventional alignment-based methods for whole genome phylogenetic analysis.
Collapse
Affiliation(s)
- Aridaman Pandit
- Mathematical Modeling and Computational Biology Group, Centre for Cellular and Molecular Biology (CSIR), Hyderabad 500007, India
| | | | | |
Collapse
|
39
|
Tetushkin EY. Genetic aspects of genealogy. RUSS J GENET+ 2011. [DOI: 10.1134/s1022795411110160] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/23/2022]
|
40
|
Sims GE, Kim SH. Whole-genome phylogeny of Escherichia coli/Shigella group by feature frequency profiles (FFPs). Proc Natl Acad Sci U S A 2011; 108:8329-34. [PMID: 21536867 PMCID: PMC3100984 DOI: 10.1073/pnas.1105168108] [Citation(s) in RCA: 101] [Impact Index Per Article: 7.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022] Open
Abstract
A whole-genome phylogeny of the Escherichia coli/Shigella group was constructed by using the feature frequency profile (FFP) method. This alignment-free approach uses the frequencies of l-mer features of whole genomes to infer phylogenic distances. We present two phylogenies that accentuate different aspects of E. coli/Shigella genomic evolution: (i) one based on the compositions of all possible features of length l = 24 (∼8.4 million features), which are likely to reveal the phenetic grouping and relationship among the organisms and (ii) the other based on the compositions of core features with low frequency and low variability (∼0.56 million features), which account for ∼69% of all commonly shared features among 38 taxa examined and are likely to have genome-wide lineal evolutionary signal. Shigella appears as a single clade when all possible features are used without filtering of noncore features. However, results using core features show that Shigella consists of at least two distantly related subclades, implying that the subclades evolved into a single clade because of a high degree of convergence influenced by mobile genetic elements and niche adaptation. In both FFP trees, the basal group of the E. coli/Shigella phylogeny is the B2 phylogroup, which contains primarily uropathogenic strains, suggesting that the E. coli/Shigella ancestor was likely a facultative or opportunistic pathogen. The extant commensal strains diverged relatively late and appear to be the result of reductive evolution of genomes. We also identify clade distinguishing features and their associated genomic regions within each phylogroup. Such features may provide useful information for understanding evolution of the groups and for quick diagnostic identification of each phylogroup.
Collapse
Affiliation(s)
- Gregory E. Sims
- Department of Informatics, J. Craig Venter Institute, Rockville, MD 20850
- Physical Biosciences Division, Lawrence Berkeley National Laboratory, Berkeley, CA 94720
| | - Sung-Hou Kim
- Physical Biosciences Division, Lawrence Berkeley National Laboratory, Berkeley, CA 94720
- Department of Chemistry, University of California, Berkeley CA 94720-1460; and
- Department of Integrated OMICS for Biomedical Sciences, Graduate School, Yonsei University, Seoul 120-749, Republic of Korea
| |
Collapse
|
41
|
Alignment-free comparison of genome sequences by a new numerical characterization. J Theor Biol 2011; 281:107-12. [PMID: 21536050 DOI: 10.1016/j.jtbi.2011.04.003] [Citation(s) in RCA: 22] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/04/2010] [Revised: 04/01/2011] [Accepted: 04/02/2011] [Indexed: 01/29/2023]
Abstract
In order to compare different genome sequences, an alignment-free method has proposed. First, we presented a new graphical representation of DNA sequences without degeneracy, which is conducive to intuitive comparison of sequences. Then, a new numerical characterization based on the representation was introduced to quantitatively depict the intrinsic nature of genome sequences, and considered as a 10-dimensional vector in the mathematical space. Alignment-free comparison of sequences was performed by computing the distances between vectors of the corresponding numerical characterizations, which define the evolutionary relationship. Two data sets of DNA sequences were constructed to assess the performance on sequence comparison. The results illustrate well validity of the method. The new numerical characterization provides a powerful tool for genome comparison.
Collapse
|
42
|
Fraser MO. New Insights into the Pathophysiology of Detrusor-Sphincter Dyssynergia. CURRENT BLADDER DYSFUNCTION REPORTS 2011. [DOI: 10.1007/s11884-011-0083-x] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/30/2022]
|
43
|
Afreixo V, Bastos CAC, Pinho AJ, Garcia SP, Ferreira PJSG. Genome analysis with distance to the nearest dissimilar nucleotide. J Theor Biol 2011; 275:52-8. [PMID: 21295040 DOI: 10.1016/j.jtbi.2011.01.038] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/06/2010] [Revised: 01/24/2011] [Accepted: 01/24/2011] [Indexed: 11/16/2022]
Abstract
DNA may be represented by sequences of four symbols, but it is often useful to convert those symbols into real or complex numbers for further analysis. Several mapping schemes have been used in the past, but most of them seem to be unrelated to any intrinsic characteristic of DNA. The objective of this work was to study a mapping scheme that is directly related to DNA characteristics, and that could be useful in discriminating between different species. Recently, we have proposed a methodology based on the inter-nucleotide distance, which proved to contribute to the discrimination among species. In this paper, we introduce a new distance, the distance to the nearest dissimilar nucleotide, which is the distance of a nucleotide to first occurrence of a different nucleotide. This distance is related to the repetition structure of single nucleotides. Using the information resulting from the concatenation of the distance to the nearest dissimilar and the inter-nucleotide distance, we found that this new distance brings additional discriminative capabilities. This suggests that the distance to the nearest dissimilar nucleotide might contribute with useful information about the evolution of the species.
Collapse
Affiliation(s)
- Vera Afreixo
- Department of Mathematics, University of Aveiro, 3810-193 Aveiro, Portugal.
| | | | | | | | | |
Collapse
|
44
|
Pacheco MA, Battistuzzi FU, Lentino M, Aguilar RF, Kumar S, Escalante AA. Evolution of modern birds revealed by mitogenomics: timing the radiation and origin of major orders. Mol Biol Evol 2011; 28:1927-42. [PMID: 21242529 DOI: 10.1093/molbev/msr014] [Citation(s) in RCA: 149] [Impact Index Per Article: 11.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
Mitochondrial (mt) genes and genomes are among the major sources of data for evolutionary studies in birds. This places mitogenomic studies in birds at the core of intense debates in avian evolutionary biology. Indeed, complete mt genomes are actively been used to unveil the phylogenetic relationships among major orders, whereas single genes (e.g., cytochrome c oxidase I [COX1]) are considered standard for species identification and defining species boundaries (DNA barcoding). In this investigation, we study the time of origin and evolutionary relationships among Neoaves orders using complete mt genomes. First, we were able to solve polytomies previously observed at the deep nodes of the Neoaves phylogeny by analyzing 80 mt genomes, including 17 new sequences reported in this investigation. As an example, we found evidence indicating that columbiforms and charadriforms are sister groups. Overall, our analyses indicate that by improving the taxonomic sampling, complete mt genomes can solve the evolutionary relationships among major bird groups. Second, we used our phylogenetic hypotheses to estimate the time of origin of major avian orders as a way to test if their diversification took place prior to the Cretaceous/Tertiary (K/T) boundary. Such timetrees were estimated using several molecular dating approaches and conservative calibration points. Whereas we found time estimates slightly younger than those reported by others, most of the major orders originated prior to the K/T boundary. Finally, we used our timetrees to estimate the rate of evolution of each mt gene. We found great variation on the mutation rates among mt genes and within different bird groups. COX1 was the gene with less variation among Neoaves orders and the one with the least amount of rate heterogeneity across lineages. Such findings support the choice of COX 1 among mt genes as target for developing DNA barcoding approaches in birds.
Collapse
Affiliation(s)
- M Andreína Pacheco
- Center for Evolutionary Medicine and Informatics, The Biodesign Institute, Arizona State University, AZ, USA
| | | | | | | | | | | |
Collapse
|
45
|
Tao W, Zou M, Wang X, Gan X, Mayden RL, He S. Phylogenomic analysis resolves the formerly intractable adaptive diversification of the endemic clade of east Asian Cyprinidae (Cypriniformes). PLoS One 2010; 5:e13508. [PMID: 20976012 PMCID: PMC2958143 DOI: 10.1371/journal.pone.0013508] [Citation(s) in RCA: 25] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/12/2010] [Accepted: 09/21/2010] [Indexed: 11/19/2022] Open
Abstract
Despite their great diversity and biological importance, evolutionary relationships among the endemic clade of East Asian Cyprinidae remain ambiguous. Understanding the phylogenetic history of this group involves many challenges. For instance, ecomorphological convergence may confound morphology-based phylogenetic inferences, and previous molecular phylogenetic studies based on single genes have often yielded contradictory and poorly supported trees. We assembled a comprehensive data matrix of 100 nuclear gene segments (∼ 71132 base pairs) for representative species of the endemic East Asian cyprinid fauna and recovered a robust phylogeny from this genome-wide signal supported by multiple analytical methods, including maximum parsimony, maximum likelihood and Bayesian inference. Relaxed molecular clock analyses indicated species radiations of this clade concentrated at approximately 1.9–7.6 MYA. We provide evidence that the bursts of diversification in this fauna are directly linked to major paleoenvironmental events associated with monsoon evolution occurring from late Miocene to Pliocene. Ancestral state reconstruction reveals convergent morphological characters are hypothesized to be independent products of similar selective pressures in ecosystems. Our study is the first comprehensive phylogenetic study of the enigmatic East-Asian cyprinids. The explicit molecular phylogeny provides a valuable framework for future research in genome evolution, adaptation and speciation of cyprinids.
Collapse
Affiliation(s)
- Wenjing Tao
- Key Laboratory of Aquatic Biodiversity and Conservation of Chinese Academy of Sciences, Institute of Hydrobiology, Chinese Academy of Sciences, Wuhan, People's Republic of China
- Graduate University of Chinese Academy of Sciences, Beijing, People's Republic of China
| | - Ming Zou
- Key Laboratory of Aquatic Biodiversity and Conservation of Chinese Academy of Sciences, Institute of Hydrobiology, Chinese Academy of Sciences, Wuhan, People's Republic of China
- Graduate University of Chinese Academy of Sciences, Beijing, People's Republic of China
| | - Xuzhen Wang
- Key Laboratory of Aquatic Biodiversity and Conservation of Chinese Academy of Sciences, Institute of Hydrobiology, Chinese Academy of Sciences, Wuhan, People's Republic of China
- Graduate University of Chinese Academy of Sciences, Beijing, People's Republic of China
| | - Xiaoni Gan
- Key Laboratory of Aquatic Biodiversity and Conservation of Chinese Academy of Sciences, Institute of Hydrobiology, Chinese Academy of Sciences, Wuhan, People's Republic of China
- Graduate University of Chinese Academy of Sciences, Beijing, People's Republic of China
| | - Richard L. Mayden
- Laboratory of Integrated Genomics, Biodiversity, and Conservation, Department of Biology, Saint Louis University, Saint Louis, Missouri, United States of America
| | - Shunping He
- Key Laboratory of Aquatic Biodiversity and Conservation of Chinese Academy of Sciences, Institute of Hydrobiology, Chinese Academy of Sciences, Wuhan, People's Republic of China
- * E-mail:
| |
Collapse
|
46
|
Ma X, Wang Z, Zhang X. Evolution of dopamine-related systems: biosynthesis, degradation and receptors. J Mol Evol 2010; 71:374-84. [PMID: 20890594 DOI: 10.1007/s00239-010-9392-5] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/27/2010] [Accepted: 09/13/2010] [Indexed: 10/19/2022]
Abstract
The evolution of enzyme genes at the pathway level has attracted increasing attention in recent years. Most investigations have focused on microorganisms, plants and invertebrates but rarely on vertebrates. The dopamine pathway, which participates in almost every aspect of brain function, is an excellent candidate for study at the pathway level. Herein, we report data on the divergence of six dopamine metabolic enzyme genes (three anabolic, three catabolic enzymes) and five dopamine receptor genes across five mammals, namely Homo sapiens, Pan troglodytes, Macaca mulatta, Mus musculus, and Rattus norvegicus. For enzyme genes, our data confirm previous conclusion that the upstream genes have evolved more slowly than downstream genes. Moreover, we found that catabolic genes in the dopamine metabolic pathway have evolved faster than anabolic genes, and maximum likelihood analysis suggested that this difference in evolutionary rate may be explained by anabolic genes being more constrained during selection. For dopamine receptor genes, however, the broadly expressed genes have tended to evolve more slowly than the narrowly expressed genes; maximum likelihood analysis showed that the relatively rapid evolutionary rate of the narrowly expressed receptor genes was a consequence of relaxed selective constraints. Finally, our data imply that selective constraints on synonymous sites in enzyme genes are relaxed compared with those of receptor genes because of differences in their patterns of functional regulation.
Collapse
Affiliation(s)
- Xianghui Ma
- Department of Biochemical Engineering, School of Chemical Engineering and Technology, Tianjin University, Tianjin, 300072, People's Republic of China.
| | | | | |
Collapse
|