Reference Citation Analysis: Find an Article, Find a Category, Find a Journal, Find a Scholar

For: Zielezinski A, Girgis HZ, Bernard G, Leimeister CA, Tang K, Dencker T, Lau AK, Röhling S, Choi JJ, Waterman MS, Comin M, Kim SH, Vinga S, Almeida JS, Chan CX, James BT, Sun F, Morgenstern B, Karlowski WM. Benchmarking of alignment-free sequence comparison methods. Genome Biol 2019;20:144. [PMID: 31345254 PMCID: PMC6659240 DOI: 10.1186/s13059-019-1755-7] [Citation(s) in RCA: 101] [Impact Index Per Article: 20.2] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/16/2019] [Accepted: 07/03/2019] [Indexed: 11/22/2022] Open

For:	Zielezinski A, Girgis HZ, Bernard G, Leimeister CA, Tang K, Dencker T, Lau AK, Röhling S, Choi JJ, Waterman MS, Comin M, Kim SH, Vinga S, Almeida JS, Chan CX, James BT, Sun F, Morgenstern B, Karlowski WM. Benchmarking of alignment-free sequence comparison methods. Genome Biol 2019;20:144. [PMID: 31345254 PMCID: PMC6659240 DOI: 10.1186/s13059-019-1755-7] [Citation(s) in RCA: 101] [Impact Index Per Article: 20.2] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/16/2019] [Accepted: 07/03/2019] [Indexed: 11/22/2022] Open

Number

Cited by Other Article(s)

Identification of HIV Rapid Mutations Using Differences in Nucleotide Distribution over Time. Genes (Basel) 2022;13:genes13020170. [PMID: 35205215 PMCID: PMC8872422 DOI: 10.3390/genes13020170] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/28/2021] [Revised: 01/08/2022] [Accepted: 01/12/2022] [Indexed: 02/07/2023] Open

Jamdade R, Al-Shaer K, Al-Sallani M, Al-Harthi E, Mahmoud T, Gairola S, Shabana HA. Multilocus marker-based delimitation of Salicornia persica and its population discrimination assisted by supervised machine learning approach. PLoS One 2022;17:e0270463. [PMID: 35895732 PMCID: PMC9328517 DOI: 10.1371/journal.pone.0270463] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/28/2021] [Accepted: 06/10/2022] [Indexed: 11/18/2022] Open

Abstract

The Salicornia L. has been considered one of the most taxonomically challenging genera due to high morphological plasticity, intergradation between related species, and lack of diagnostic features in preserved herbarium specimens. In the United Arab Emirates (UAE), only one species of this genus, Salicornia europaea, has been reported, though investigating its identity at the molecular level has not yet been undertaken. Moreover, based on growth form and morphology variation between the Ras-Al-Khaimah (RAK) population and the Umm-Al-Quwain (UAQ) population, we suspect the presence of different species or morphotypes. The present study aimed to initially perform species identification using multilocus DNA barcode markers from chloroplast DNA (cpDNA) and nuclear ribosomal DNA (nrDNA), followed by the genetic divergence between two populations (RAK and UAQ) belonging to two different coastal localities in the UAE. The analysis resulted in high-quality multilocus barcode sequences subjected to species discrimination through the unsupervised OTU picking and supervised learning methods. The ETS sequence data from our study sites had high identity with the previously reported sequences of Salicornia persica using NCBI blast and was further confirmed using OTU picking methods viz., TaxonDNAs Species identifier and Assemble Species by Automatic Partitioning (ASAP). Moreover, matK sequence data showed a non-monophyletic relationship, and significant discrimination between the two populations through alignment-based unsupervised OTU picking, alignment-free Co-Phylog, and alignment & alignment-free supervised learning approaches. Other markers viz., rbcL, trnH-psbA, ITS2, and ETS could not distinguish the two populations individually, though their combination with matK (cpDNA & cpDNA+nrDNA) showed enough population discrimination. However, the ITS2+ETS (nrDNA) exhibited much higher genetic divergence, further splitting both the populations into four haplotypes. Based on the observed morphology, genetic divergence, and the number of haplotypes predicted using the matK marker, it can be suggested that two distinct populations (RAK and UAQ) do exist. Further extensive morpho-taxonomic studies are required to determine the inter-population variability of Salicornia in the UAE. Altogether, our results suggest that S. persica is the species that grow in the present study area in UAE, and do not support previous treatments as S. europaea.

Collapse

He L, Sun S, Zhang Q, Bao X, Li PK. Alignment-free sequence comparison for virus genomes based on location correlation coefficient. INFECTION, GENETICS AND EVOLUTION : JOURNAL OF MOLECULAR EPIDEMIOLOGY AND EVOLUTIONARY GENETICS IN INFECTIOUS DISEASES 2021;96:105106. [PMID: 34626822 PMCID: PMC8493760 DOI: 10.1016/j.meegid.2021.105106] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 06/20/2021] [Revised: 09/08/2021] [Accepted: 10/03/2021] [Indexed: 12/18/2022]

Bussi Y, Kapon R, Reich Z. Large-scale k-mer-based analysis of the informational properties of genomes, comparative genomics and taxonomy. PLoS One 2021;16:e0258693. [PMID: 34648558 PMCID: PMC8516232 DOI: 10.1371/journal.pone.0258693] [Citation(s) in RCA: 16] [Impact Index Per Article: 5.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/30/2021] [Accepted: 10/02/2021] [Indexed: 12/24/2022] Open

Ruohan W, Xianglilan Z, Jianping W, Shuai Cheng LI. DeepHost: phage host prediction with convolutional neural network. Brief Bioinform 2021;23:6374063. [PMID: 34553750 DOI: 10.1093/bib/bbab385] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/08/2021] [Revised: 08/10/2021] [Accepted: 08/27/2021] [Indexed: 01/21/2023] Open

VanWallendael A, Alvarez M. Alignment-free methods for polyploid genomes: Quick and reliable genetic distance estimation. Mol Ecol Resour 2021;22:612-622. [PMID: 34478242 DOI: 10.1111/1755-0998.13499] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/04/2020] [Accepted: 08/20/2021] [Indexed: 01/10/2023]

Andreace F, Pizzi C, Comin M. MetaProb 2: Metagenomic Reads Binning Based on Assembly Using Minimizers and K-Mers Statistics. J Comput Biol 2021;28:1052-1062. [PMID: 34448593 DOI: 10.1089/cmb.2021.0270] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/28/2022] Open

Rachtman E, Bafna V, Mirarab S. CONSULT: accurate contamination removal using locality-sensitive hashing. NAR Genom Bioinform 2021;3:lqab071. [PMID: 34377979 PMCID: PMC8340999 DOI: 10.1093/nargab/lqab071] [Citation(s) in RCA: 12] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/10/2021] [Revised: 06/30/2021] [Accepted: 07/19/2021] [Indexed: 12/27/2022] Open

Pratas D, Silva JM. Persistent minimal sequences of SARS-CoV-2. Bioinformatics 2021;36:5129-5132. [PMID: 32730589 PMCID: PMC7559010 DOI: 10.1093/bioinformatics/btaa686] [Citation(s) in RCA: 6] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/13/2020] [Revised: 07/14/2020] [Accepted: 07/22/2020] [Indexed: 12/22/2022] Open

Sun Q, Peng Y, Liu J. A reference-free approach for cell type classification with scRNA-seq. iScience 2021;24:102855. [PMID: 34381979 PMCID: PMC8335627 DOI: 10.1016/j.isci.2021.102855] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/21/2021] [Revised: 05/07/2021] [Accepted: 07/08/2021] [Indexed: 11/29/2022] Open

Akon M, Akon M, Kabir M, Rahman MS, Rahman MS. ADACT: a tool for analysing (dis)similarity among nucleotide and protein sequences using minimal and relative absent words. Bioinformatics 2021;37:1468-1470. [PMID: 33016997 DOI: 10.1093/bioinformatics/btaa853] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/21/2019] [Revised: 09/09/2020] [Accepted: 09/21/2020] [Indexed: 11/14/2022] Open

Liang KYH, Orata FD, Boucher YF, Case RJ. Roseobacters in a Sea of Poly- and Paraphyly: Whole Genome-Based Taxonomy of the Family Rhodobacteraceae and the Proposal for the Split of the "Roseobacter Clade" Into a Novel Family, Roseobacteraceae fam. nov. Front Microbiol 2021;12:683109. [PMID: 34248901 PMCID: PMC8267831 DOI: 10.3389/fmicb.2021.683109] [Citation(s) in RCA: 50] [Impact Index Per Article: 16.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/19/2021] [Accepted: 05/27/2021] [Indexed: 11/13/2022] Open

Abstract

The family Rhodobacteraceae consists of alphaproteobacteria that are metabolically, phenotypically, and ecologically diverse. It includes the roseobacter clade, an informal designation, representing one of the most abundant groups of marine bacteria. The rapid pace of discovery of novel roseobacters in the last three decades meant that the best practice for taxonomic classification, a polyphasic approach utilizing phenotypic, genotypic, and phylogenetic characteristics, was not always followed. Early efforts for classification relied heavily on 16S rRNA gene sequence similarity and resulted in numerous taxonomic inconsistencies, with several poly- and paraphyletic genera within this family. Next-generation sequencing technologies have allowed whole-genome sequences to be obtained for most type strains, making a revision of their taxonomy possible. In this study, we performed whole-genome phylogenetic and genotypic analyses combined with a meta-analysis of phenotypic data to review taxonomic classifications of 331 type strains (under 119 genera) within the Rhodobacteraceae family. Representatives of the roseobacter clade not only have different environmental adaptions from other Rhodobacteraceae isolates but were also found to be distinct based on genomic, phylogenetic, and in silico-predicted phenotypic data. As such, we propose to move this group of bacteria into a new family, Roseobacteraceae fam. nov. In total, reclassifications resulted to 327 species and 128 genera, suggesting that misidentification is more problematic at the genus than species level. By resolving taxonomic inconsistencies of type strains within this family, we have established a set of coherent criteria based on whole-genome-based analyses that will help guide future taxonomic efforts and prevent the propagation of errors.

Collapse

CVTree: A Parallel Alignment-free Phylogeny and Taxonomy Tool based on Composition Vectors of Genomes. GENOMICS PROTEOMICS & BIOINFORMATICS 2021;19:662-667. [PMID: 34119695 PMCID: PMC9040009 DOI: 10.1016/j.gpb.2021.03.006] [Citation(s) in RCA: 22] [Impact Index Per Article: 7.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 11/22/2020] [Revised: 02/23/2021] [Accepted: 03/06/2021] [Indexed: 11/21/2022]

Ni H, Mu H, Qi D. Applying frequency chaos game representation with perceptual image hashing to gene sequence phylogenetic analyses. J Mol Graph Model 2021;107:107942. [PMID: 34058640 DOI: 10.1016/j.jmgm.2021.107942] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/14/2021] [Revised: 04/16/2021] [Accepted: 05/10/2021] [Indexed: 11/28/2022]

Lee B, Smith DK, Guan Y. Alignment free sequence comparison methods and reservoir host prediction. Bioinformatics 2021;37:3337-3342. [PMID: 33964132 PMCID: PMC8135978 DOI: 10.1093/bioinformatics/btab338] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/15/2021] [Revised: 03/29/2021] [Accepted: 04/30/2021] [Indexed: 11/19/2022] Open

Lu YY, Bai J, Wang Y, Wang Y, Sun F. CRAFT: Compact genome Representation toward large-scale Alignment-Free daTabase. Bioinformatics 2021;37:155-161. [PMID: 32766810 DOI: 10.1093/bioinformatics/btaa699] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/18/2019] [Revised: 03/11/2020] [Accepted: 07/28/2020] [Indexed: 01/02/2023] Open

Kaden M, Bohnsack KS, Weber M, Kudła M, Gutowska K, Blazewicz J, Villmann T. Learning vector quantization as an interpretable classifier for the detection of SARS-CoV-2 types based on their RNA sequences. Neural Comput Appl 2021;34:67-78. [PMID: 33935376 PMCID: PMC8076884 DOI: 10.1007/s00521-021-06018-2] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/13/2020] [Accepted: 04/07/2021] [Indexed: 02/06/2023]

Jacobus AP, Stephens TG, Youssef P, González-Pech R, Ciccotosto-Camp MM, Dougan KE, Chen Y, Basso LC, Frazzon J, Chan CX, Gross J. Comparative Genomics Supports That Brazilian Bioethanol Saccharomyces cerevisiae Comprise a Unified Group of Domesticated Strains Related to Cachaça Spirit Yeasts. Front Microbiol 2021;12:644089. [PMID: 33936002 PMCID: PMC8082247 DOI: 10.3389/fmicb.2021.644089] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/19/2020] [Accepted: 03/08/2021] [Indexed: 01/05/2023] Open

Abstract

Ethanol production from sugarcane is a key renewable fuel industry in Brazil. Major drivers of this alcoholic fermentation are Saccharomyces cerevisiae strains that originally were contaminants to the system and yet prevail in the industrial process. Here we present newly sequenced genomes (using Illumina short-read and PacBio long-read data) of two monosporic isolates (H3 and H4) of the S. cerevisiae PE-2, a predominant bioethanol strain in Brazil. The assembled genomes of H3 and H4, together with 42 draft genomes of sugarcane-fermenting (fuel ethanol plus cachaça) strains, were compared against those of the reference S288C and diverse S. cerevisiae. All genomes of bioethanol yeasts have amplified SNO2(3)/SNZ2(3) gene clusters for vitamin B1/B6 biosynthesis, and display ubiquitous presence of a particular family of SAM-dependent methyl transferases, rare in S. cerevisiae. Widespread amplifications of quinone oxidoreductases YCR102C/YLR460C/YNL134C, and the structural or punctual variations among aquaporins and components of the iron homeostasis system, likely represent adaptations to industrial fermentation. Interesting is the pervasive presence among the bioethanol/cachaça strains of a five-gene cluster (Region B) that is a known phylogenetic signature of European wine yeasts. Combining genomes of H3, H4, and 195 yeast strains, we comprehensively assessed whole-genome phylogeny of these taxa using an alignment-free approach. The 197-genome phylogeny substantiates that bioethanol yeasts are monophyletic and closely related to the cachaça and wine strains. Our results support the hypothesis that biofuel-producing yeasts in Brazil may have been co-opted from a pool of yeasts that were pre-adapted to alcoholic fermentation of sugarcane for the distillation of cachaça spirit, which historically is a much older industry than the large-scale fuel ethanol production.

Collapse

Affiliation(s)

Ana Paula Jacobus Laboratory for Genomics and Experimental Evolution of Yeasts, Institute for Bioenergy Research, São Paulo State University, Rio Claro, Brazil
Timothy G Stephens Institute for Molecular Bioscience, The University of Queensland, Brisbane, QLD, Australia
Pierre Youssef Institute for Molecular Bioscience, The University of Queensland, Brisbane, QLD, Australia
Raul González-Pech Institute for Molecular Bioscience, The University of Queensland, Brisbane, QLD, Australia
Michael M Ciccotosto-Camp Australian Centre for Ecogenomics, The University of Queensland, Brisbane, QLD, Australia.,School of Chemistry and Molecular Biosciences, The University of Queensland, Brisbane, QLD, Australia
Katherine E Dougan Australian Centre for Ecogenomics, The University of Queensland, Brisbane, QLD, Australia.,School of Chemistry and Molecular Biosciences, The University of Queensland, Brisbane, QLD, Australia
Yibi Chen Institute for Molecular Bioscience, The University of Queensland, Brisbane, QLD, Australia.,Australian Centre for Ecogenomics, The University of Queensland, Brisbane, QLD, Australia.,School of Chemistry and Molecular Biosciences, The University of Queensland, Brisbane, QLD, Australia
Luiz Carlos Basso Biological Science Department, Escola Superior de Agricultura Luiz de Queiroz, University of São Paulo (USP), Piracicaba, Brazil
Jeverson Frazzon Institute of Food Science and Technology, Federal University of Rio Grande do Sul (UFRGS), Porto Alegre, Brazil
Cheong Xin Chan Institute for Molecular Bioscience, The University of Queensland, Brisbane, QLD, Australia.,Australian Centre for Ecogenomics, The University of Queensland, Brisbane, QLD, Australia.,School of Chemistry and Molecular Biosciences, The University of Queensland, Brisbane, QLD, Australia
Jeferson Gross Laboratory for Genomics and Experimental Evolution of Yeasts, Institute for Bioenergy Research, São Paulo State University, Rio Claro, Brazil

Collapse

Jonkheer EM, Brankovics B, Houwers IM, van der Wolf JM, Bonants PJM, Vreeburg RAM, Bollema R, de Haan JR, Berke L, Smit S, de Ridder D, van der Lee TAJ. The Pectobacterium pangenome, with a focus on Pectobacterium brasiliense, shows a robust core and extensive exchange of genes from a shared gene pool. BMC Genomics 2021;22:265. [PMID: 33849459 PMCID: PMC8045196 DOI: 10.1186/s12864-021-07583-5] [Citation(s) in RCA: 19] [Impact Index Per Article: 6.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/21/2020] [Accepted: 03/26/2021] [Indexed: 01/12/2023] Open

Abstract

BACKGROUND

Bacterial plant pathogens of the Pectobacterium genus are responsible for a wide spectrum of diseases in plants, including important crops such as potato, tomato, lettuce, and banana. Investigation of the genetic diversity underlying virulence and host specificity can be performed at genome level by using a comprehensive comparative approach called pangenomics. A pangenomic approach, using newly developed functionalities in PanTools, was applied to analyze the complex phylogeny of the Pectobacterium genus. We specifically used the pangenome to investigate genetic differences between virulent and avirulent strains of P. brasiliense, a potato blackleg causing species dominantly present in Western Europe.

RESULTS

Here we generated a multilevel pangenome for Pectobacterium, comprising 197 strains across 19 species, including type strains, with a focus on P. brasiliense. The extensive phylogenetic analysis of the Pectobacterium genus showed robust distinct clades, with most detail provided by 452,388 parsimony-informative single-nucleotide polymorphisms identified in single-copy orthologs. The average Pectobacterium genome consists of 47% core genes, 1% unique genes, and 52% accessory genes. Using the pangenome, we zoomed in on differences between virulent and avirulent P. brasiliense strains and identified 86 genes associated to virulent strains. We found that the organization of genes is highly structured and linked with gene conservation, function, and transcriptional orientation.

CONCLUSION

The pangenome analysis demonstrates that evolution in Pectobacteria is a highly dynamic process, including gene acquisitions partly in clusters, genome rearrangements, and loss of genes. Pectobacterium species are typically not characterized by a set of species-specific genes, but instead present themselves using new gene combinations from the shared gene pool. A multilevel pangenomic approach, fusing DNA, protein, biological function, taxonomic group, and phenotypes, facilitates studies in a flexible taxonomic context.

Collapse

Sequence Comparison Without Alignment: The SpaM Approaches. METHODS IN MOLECULAR BIOLOGY (CLIFTON, N.J.) 2021;2231:121-134. [PMID: 33289890 DOI: 10.1007/978-1-0716-1036-7_8] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Subscribe] [Scholar Register] [Indexed: 10/22/2022]

Du N, Shang J, Sun Y. Improving protein domain classification for third-generation sequencing reads using deep learning. BMC Genomics 2021;22:251. [PMID: 33836667 PMCID: PMC8033682 DOI: 10.1186/s12864-021-07468-7] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/28/2020] [Accepted: 02/19/2021] [Indexed: 12/21/2022] Open

Břinda K, Baym M, Kucherov G. Simplitigs as an efficient and scalable representation of de Bruijn graphs. Genome Biol 2021;22:96. [PMID: 33823902 PMCID: PMC8025321 DOI: 10.1186/s13059-021-02297-z] [Citation(s) in RCA: 10] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/09/2020] [Accepted: 02/10/2021] [Indexed: 12/30/2022] Open

Rossier V, Warwick Vesztrocy A, Robinson-Rechavi M, Dessimoz C. OMAmer: tree-driven and alignment-free protein assignment to subfamilies outperforms closest sequence approaches. Bioinformatics 2021;37:2866-2873. [PMID: 33787851 PMCID: PMC8479680 DOI: 10.1093/bioinformatics/btab219] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/06/2020] [Revised: 02/18/2021] [Accepted: 03/30/2021] [Indexed: 02/02/2023] Open

Girgis HZ, James BT, Luczak BB. Identity: rapid alignment-free prediction of sequence alignment identity scores using self-supervised general linear models. NAR Genom Bioinform 2021;3:lqab001. [PMID: 33554117 PMCID: PMC7850047 DOI: 10.1093/nargab/lqab001] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/14/2019] [Revised: 12/07/2020] [Accepted: 01/08/2021] [Indexed: 11/12/2022] Open

Chakraborty A, Morgenstern B, Bandyopadhyay S. S-conLSH: alignment-free gapped mapping of noisy long reads. BMC Bioinformatics 2021;22:64. [PMID: 33573603 PMCID: PMC7879691 DOI: 10.1186/s12859-020-03918-3] [Citation(s) in RCA: 7] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/22/2020] [Accepted: 12/02/2020] [Indexed: 11/16/2022] Open

Petrillo UF, Palini F, Cattaneo G, Giancarlo R. Alignment-free Genomic Analysis via a Big Data Spark Platform. Bioinformatics 2021;37:1658-1665. [PMID: 33471066 DOI: 10.1093/bioinformatics/btab014] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/22/2020] [Revised: 12/28/2020] [Accepted: 01/06/2021] [Indexed: 11/12/2022] Open

Abstract

MOTIVATION

Alignment-free distance and similarity functions (AF functions, for short) are a well established alternative to pairwise and multiple sequence alignments for many genomic, metagenomic and epigenomic tasks. Due to data-intensive applications, the computation of AF functions is a Big Data problem, with the recent literature indicating that the development of fast and scalable algorithms computing AF functions is a high-priority task. Somewhat surprisingly, despite the increasing popularity of Big Data technologies in computational biology, the development of a Big Data platform for those tasks has not been pursued, possibly due to its complexity.

RESULTS

We fill this important gap by introducing FADE, the first extensible, efficient and scalable Spark platform for alignment-free genomic analysis. It supports natively eighteen of the best performing AF functions coming out of a recent hallmark benchmarking study. FADE development and potential impact comprises novel aspects of interest. Namely, (a) a considerable effort of distributed algorithms, the most tangible result being a much faster execution time of reference methods like MASH and FSWM; (b) a software design that makes FADE user-friendly and easily extendable by Spark non-specialists; (c) its ability to support data- and compute-intensive tasks. About this, we provide a novel and much needed analysis of how informative and robust AF functions are, in terms of the statistical significance of their output. Our findings naturally extend the ones of the highly regarded benchmarking study, since the functions that can really be used are reduced to a handful of the eighteen included in FADE.

AVAILABILITY

The software and the datasets are available at https://github.com/fpalini/fade.

SUPPLEMENTARY INFORMATION

Supplementary data are available at Bioinformatics online.

Collapse

Gemović B, Perović V, Davidović R, Drljača T, Veljkovic N. Alignment-free method for functional annotation of amino acid substitutions: Application on epigenetic factors involved in hematologic malignancies. PLoS One 2021;16:e0244948. [PMID: 33395407 PMCID: PMC7781373 DOI: 10.1371/journal.pone.0244948] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/07/2020] [Accepted: 12/21/2020] [Indexed: 11/19/2022] Open

Inferring Phylogenomic Relationship of Microbes Using Scalable Alignment-Free Methods. Methods Mol Biol 2021;2242:69-76. [PMID: 33961218 DOI: 10.1007/978-1-0716-1099-2_5] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/27/2022]

Almeida JR, Pratas D, Oliveira JL. A semi-automatic methodology for analysing distributed and private biobanks. Comput Biol Med 2020;130:104180. [PMID: 33360272 DOI: 10.1016/j.compbiomed.2020.104180] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/24/2020] [Revised: 12/14/2020] [Accepted: 12/14/2020] [Indexed: 10/22/2022]

Song K. Classifying the Lifestyle of Metagenomically-Derived Phages Sequences Using Alignment-Free Methods. Front Microbiol 2020;11:567769. [PMID: 33304326 PMCID: PMC7693541 DOI: 10.3389/fmicb.2020.567769] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/30/2020] [Accepted: 10/22/2020] [Indexed: 01/20/2023] Open

Horizontal Gene Transfer in Eukaryotes: Not if, but How Much? Trends Genet 2020;36:915-925. [DOI: 10.1016/j.tig.2020.08.006] [Citation(s) in RCA: 19] [Impact Index Per Article: 4.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/04/2020] [Revised: 07/31/2020] [Accepted: 08/10/2020] [Indexed: 12/17/2022]

Criscuolo A. On the transformation of MinHash-based uncorrected distances into proper evolutionary distances for phylogenetic inference. F1000Res 2020;9:1309. [PMID: 33335719 PMCID: PMC7713896 DOI: 10.12688/f1000research.26930.1] [Citation(s) in RCA: 13] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Accepted: 10/12/2020] [Indexed: 12/29/2022] Open

Pornputtapong N, Acheampong DA, Patumcharoenpol P, Jenjaroenpun P, Wongsurawat T, Jun SR, Yongkiettrakul S, Chokesajjawatee N, Nookaew I. KITSUNE: A Tool for Identifying Empirically Optimal K-mer Length for Alignment-Free Phylogenomic Analysis. Front Bioeng Biotechnol 2020;8:556413. [PMID: 33072720 PMCID: PMC7538862 DOI: 10.3389/fbioe.2020.556413] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/28/2020] [Accepted: 08/24/2020] [Indexed: 12/22/2022] Open

Huang J, Dai Q, Yao Y, He PA. A Generalized Iterative Map for Analysis of Protein Sequences. Comb Chem High Throughput Screen 2020;25:381-391. [PMID: 33045963 DOI: 10.2174/1386207323666201012142318] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/17/2020] [Revised: 07/30/2020] [Accepted: 08/09/2020] [Indexed: 11/22/2022]

Abstract

AIM AND OBJECTIVE

The similarities comparison of biological sequences is an important task in bioinformatics. The methods of the similarities comparison for biological sequences are divided into two classes: sequence alignment method and alignment-free method. The graphical representation of biological sequences is a kind of alignment-free method, which constitutes a tool for analyzing and visualizing the biological sequences. In this article, a generalized iterative map of protein sequences was suggested to analyze the similarities of biological sequences.

MATERIALS AND METHODS

Based on the normalized physicochemical indexes of 20 amino acids, each amino acid can be mapped into a point in 5D space. A generalized iterative function system was introduced to outline a generalized iterative map of protein sequences, which can not only reflect various physicochemical properties of amino acids but also incorporate with different compression ratios of the component of a generalized iterative map. Several properties were proved to illustrate the advantage of the generalized iterative map. The mathematical description of the generalized iterative map was suggested to compare the similarities and dissimilarities of protein sequences. Based on this method, similarities/dissimilarities were compared among ND5 protein sequences, as well as ND6 protein sequences of ten different species.

RESULTS

By correlation analysis, the ClustalW results were compared with our similarity/dissimilarity results and other graphical representation results to show the utility of our approach. The comparison results show that our approach has better correlations with ClustalW for all species than other approaches and illustrate the effectiveness of our approach.

CONCLUSION

Two examples show that our method not only has good performances and effects in the similarity/dissimilarity analysis of protein sequences but also does not require complex computation.

Collapse

Klötzl F, Haubold B. Phylonium: fast estimation of evolutionary distances from large samples of similar genomes. Bioinformatics 2020;36:2040-2046. [PMID: 31790149 PMCID: PMC7141870 DOI: 10.1093/bioinformatics/btz903] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/06/2019] [Revised: 11/01/2019] [Accepted: 11/28/2019] [Indexed: 11/13/2022] Open

Nugent CM, Adamowicz SJ. Alignment-free classification of COI DNA barcode data with the Python package Alfie. METABARCODING AND METAGENOMICS 2020. [DOI: 10.3897/mbmg.4.55815] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/04/2023] Open

Delibaş E, Arslan A, Şeker A, Diri B. A novel alignment-free DNA sequence similarity analysis approach based on top-k n-gram match-up. J Mol Graph Model 2020;100:107693. [PMID: 32805559 DOI: 10.1016/j.jmgm.2020.107693] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/12/2020] [Revised: 06/15/2020] [Accepted: 07/06/2020] [Indexed: 11/17/2022]

Pratas D, Toppinen M, Pyöriä L, Hedman K, Sajantila A, Perdomo MF. A hybrid pipeline for reconstruction and analysis of viral genomes at multi-organ level. Gigascience 2020;9:giaa086. [PMID: 32815536 PMCID: PMC7439602 DOI: 10.1093/gigascience/giaa086] [Citation(s) in RCA: 12] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/17/2020] [Revised: 05/25/2020] [Accepted: 07/23/2020] [Indexed: 12/21/2022] Open

Libkind D, Čadež N, Opulente DA, Langdon QK, Rosa CA, Sampaio JP, Gonçalves P, Hittinger CT, Lachance MA. Towards yeast taxogenomics: lessons from novel species descriptions based on complete genome sequences. FEMS Yeast Res 2020;20:5876348. [DOI: 10.1093/femsyr/foaa042] [Citation(s) in RCA: 20] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/10/2020] [Accepted: 07/23/2020] [Indexed: 01/23/2023] Open

Bohmann K, Mirarab S, Bafna V, Gilbert MTP. Beyond DNA barcoding: The unrealized potential of genome skim data in sample identification. Mol Ecol 2020;29:2521-2534. [PMID: 32542933 PMCID: PMC7496323 DOI: 10.1111/mec.15507] [Citation(s) in RCA: 33] [Impact Index Per Article: 8.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/26/2020] [Revised: 06/03/2020] [Accepted: 06/05/2020] [Indexed: 02/06/2023]

ProtPCV: A Fixed Dimensional Numerical Representation of Protein Sequence to Significantly Reduce Sequence Search Time. Interdiscip Sci 2020;12:276-287. [PMID: 32524529 DOI: 10.1007/s12539-020-00380-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/18/2019] [Revised: 05/19/2020] [Accepted: 06/02/2020] [Indexed: 10/24/2022]

Chanda P, Costa E, Hu J, Sukumar S, Van Hemert J, Walia R. Information Theory in Computational Biology: Where We Stand Today. ENTROPY (BASEL, SWITZERLAND) 2020;22:E627. [PMID: 33286399 PMCID: PMC7517167 DOI: 10.3390/e22060627] [Citation(s) in RCA: 22] [Impact Index Per Article: 5.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 04/30/2020] [Revised: 05/31/2020] [Accepted: 06/03/2020] [Indexed: 12/30/2022]

Positional Correlation Natural Vector: A Novel Method for Genome Comparison. Int J Mol Sci 2020;21:ijms21113859. [PMID: 32485813 PMCID: PMC7312176 DOI: 10.3390/ijms21113859] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/13/2020] [Revised: 05/17/2020] [Accepted: 05/26/2020] [Indexed: 12/17/2022] Open

Acman M, van Dorp L, Santini JM, Balloux F. Large-scale network analysis captures biological features of bacterial plasmids. Nat Commun 2020;11:2452. [PMID: 32415210 PMCID: PMC7229196 DOI: 10.1038/s41467-020-16282-w] [Citation(s) in RCA: 53] [Impact Index Per Article: 13.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/03/2019] [Accepted: 04/23/2020] [Indexed: 11/30/2022] Open

Hosseini M, Pratas D, Morgenstern B, Pinho AJ. Smash++: an alignment-free and memory-efficient tool to find genomic rearrangements. Gigascience 2020;9:giaa048. [PMID: 32432328 PMCID: PMC7238676 DOI: 10.1093/gigascience/giaa048] [Citation(s) in RCA: 9] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/11/2020] [Revised: 04/06/2020] [Accepted: 04/20/2020] [Indexed: 12/05/2022] Open

Smirnov V, Warnow T. Unblended disjoint tree merging using GTM improves species tree estimation. BMC Genomics 2020;21:235. [PMID: 32299343 PMCID: PMC7161100 DOI: 10.1186/s12864-020-6605-1] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/17/2022] Open

Identifying genetic variants underlying phenotypic variation in plants without complete genomes. Nat Genet 2020;52:534-540. [PMID: 32284578 PMCID: PMC7610390 DOI: 10.1038/s41588-020-0612-7] [Citation(s) in RCA: 72] [Impact Index Per Article: 18.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/31/2019] [Accepted: 03/10/2020] [Indexed: 12/11/2022]

Dencker T, Leimeister CA, Gerth M, Bleidorn C, Snir S, Morgenstern B. 'Multi-SpaM': a maximum-likelihood approach to phylogeny reconstruction using multiple spaced-word matches and quartet trees. NAR Genom Bioinform 2020;2:lqz013. [PMID: 33575565 PMCID: PMC7671388 DOI: 10.1093/nargab/lqz013] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/30/2019] [Revised: 07/31/2019] [Accepted: 10/13/2019] [Indexed: 02/03/2023] Open

Röhling S, Linne A, Schellhorn J, Hosseini M, Dencker T, Morgenstern B. The number of k-mer matches between two DNA sequences as a function of k and applications to estimate phylogenetic distances. PLoS One 2020;15:e0228070. [PMID: 32040534 PMCID: PMC7010260 DOI: 10.1371/journal.pone.0228070] [Citation(s) in RCA: 23] [Impact Index Per Article: 5.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/07/2020] [Accepted: 01/08/2020] [Indexed: 12/14/2022] Open

100

Whole-proteome tree of life suggests a deep burst of organism diversity. Proc Natl Acad Sci U S A 2020;117:3678-3686. [PMID: 32019884 PMCID: PMC7035600 DOI: 10.1073/pnas.1915766117] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/01/2023] Open

Abstract

Tree of life (ToL) is a metaphorical tree that captures a simplified narrative of the evolutionary course and kinship among all living organisms of today. We have reconstructed a whole-proteome ToL for over 4,000 different extant species for which complete or near-complete genome sequences are available in public databases. The ToL suggests that 1) all extant organisms of this study can be grouped into 2 “Supergroups,” 6 “Major Groups,” or 35+ “Groups”; 2) the order of emergence of the “founders” of all the groups may be assigned on an evolutionary progression scale; and 3) all of the founders of the groups have emerged in a “deep burst” near the root of the ToL—an explosive birth of life’s diversity.

An organism tree of life (organism ToL) is a conceptual and metaphorical tree to capture a simplified narrative of the evolutionary course and kinship among the extant organisms. Such a tree cannot be experimentally validated but may be reconstructed based on characteristics associated with the organisms. Since the whole-genome sequence of an organism is, at present, the most comprehensive descriptor of the organism, a whole-genome sequence-based ToL can be an empirically derivable surrogate for the organism ToL. However, experimentally determining the whole-genome sequences of many diverse organisms was practically impossible until recently. We have constructed three types of ToLs for diversely sampled organisms using the sequences of whole genome, of whole transcriptome, and of whole proteome. Of the three, whole-proteome sequence-based ToL (whole-proteome ToL), constructed by applying information theory-based feature frequency profile method, an “alignment-free” method, gave the most topologically stable ToL. Here, we describe the main features of a whole-proteome ToL for 4,023 species with known complete or almost complete genome sequences on grouping and kinship among the groups at deep evolutionary levels. The ToL reveals 1) all extant organisms of this study can be grouped into 2 “Supergroups,” 6 “Major Groups,” or 35+ “Groups”; 2) the order of emergence of the “founders” of all of the groups may be assigned on an evolutionary progression scale; 3) all of the founders of the groups have emerged in a “deep burst” at the very beginning period near the root of the ToL—an explosive birth of life’s diversity.

Collapse