1
|
Moeckel C, Mareboina M, Konnaris MA, Chan CS, Mouratidis I, Montgomery A, Chantzi N, Pavlopoulos GA, Georgakopoulos-Soares I. A survey of k-mer methods and applications in bioinformatics. Comput Struct Biotechnol J 2024; 23:2289-2303. [PMID: 38840832 PMCID: PMC11152613 DOI: 10.1016/j.csbj.2024.05.025] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/13/2024] [Revised: 05/14/2024] [Accepted: 05/15/2024] [Indexed: 06/07/2024] Open
Abstract
The rapid progression of genomics and proteomics has been driven by the advent of advanced sequencing technologies, large, diverse, and readily available omics datasets, and the evolution of computational data processing capabilities. The vast amount of data generated by these advancements necessitates efficient algorithms to extract meaningful information. K-mers serve as a valuable tool when working with large sequencing datasets, offering several advantages in computational speed and memory efficiency and carrying the potential for intrinsic biological functionality. This review provides an overview of the methods, applications, and significance of k-mers in genomic and proteomic data analyses, as well as the utility of absent sequences, including nullomers and nullpeptides, in disease detection, vaccine development, therapeutics, and forensic science. Therefore, the review highlights the pivotal role of k-mers in addressing current genomic and proteomic problems and underscores their potential for future breakthroughs in research.
Collapse
Affiliation(s)
- Camille Moeckel
- Institute for Personalized Medicine, Department of Biochemistry and Molecular Biology, The Pennsylvania State University College of Medicine, Hershey, PA, USA
| | - Manvita Mareboina
- Institute for Personalized Medicine, Department of Biochemistry and Molecular Biology, The Pennsylvania State University College of Medicine, Hershey, PA, USA
| | - Maxwell A. Konnaris
- Institute for Personalized Medicine, Department of Biochemistry and Molecular Biology, The Pennsylvania State University College of Medicine, Hershey, PA, USA
| | - Candace S.Y. Chan
- Department of Bioengineering and Therapeutic Sciences, University of California San Francisco, San Francisco, CA, USA
| | - Ioannis Mouratidis
- Institute for Personalized Medicine, Department of Biochemistry and Molecular Biology, The Pennsylvania State University College of Medicine, Hershey, PA, USA
- Huck Institute of the Life Sciences, Penn State University, University Park, Pennsylvania, USA
| | - Austin Montgomery
- Institute for Personalized Medicine, Department of Biochemistry and Molecular Biology, The Pennsylvania State University College of Medicine, Hershey, PA, USA
| | - Nikol Chantzi
- Institute for Personalized Medicine, Department of Biochemistry and Molecular Biology, The Pennsylvania State University College of Medicine, Hershey, PA, USA
| | | | - Ilias Georgakopoulos-Soares
- Institute for Personalized Medicine, Department of Biochemistry and Molecular Biology, The Pennsylvania State University College of Medicine, Hershey, PA, USA
- Huck Institute of the Life Sciences, Penn State University, University Park, Pennsylvania, USA
| |
Collapse
|
2
|
Wang B, Jin Y, Hu M, Zhao Y, Wang X, Yue J, Ren H. Detecting genetic gain and loss events in terms of protein domain: Method and implementation. Heliyon 2024; 10:e32103. [PMID: 38867972 PMCID: PMC11168390 DOI: 10.1016/j.heliyon.2024.e32103] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/15/2024] [Revised: 05/08/2024] [Accepted: 05/28/2024] [Indexed: 06/14/2024] Open
Abstract
Continuous gain and loss of genes are the primary driving forces of bacterial evolution and environmental adaptation. Studying bacterial evolution in terms of protein domain, which is the fundamental function and evolutionary unit of proteins, can provide a more comprehensive understanding of bacterial differentiation and phenotypic adaptation processes. Therefore, we proposed a phylogenetic tree-based method for detecting genetic gain and loss events in terms of protein domains. Specifically, the method focuses on a single domain to trace its evolution process or on multiple domains to investigate their co-evolution principles. This novel method was validated using 122 Shigella isolates. We found that the loss of a significant number of domains was likely the main driving force behind the evolution of Shigella, which could reduce energy expenditure and preserve only the most essential functions. Additionally, we observed that simultaneously gained and lost domains were often functionally related, which can facilitate and accelerate phenotypic evolutionary adaptation to the environment. All results obtained using our method agree with those of previous studies, which validates our proposed method.
Collapse
Affiliation(s)
- Boqian Wang
- Beijing Institute of Biotechnology, State Key Laboratory of Pathogen and Biosecurity, Beijing, China
| | - Yuan Jin
- Beijing Institute of Biotechnology, State Key Laboratory of Pathogen and Biosecurity, Beijing, China
| | - Mingda Hu
- Beijing Institute of Biotechnology, State Key Laboratory of Pathogen and Biosecurity, Beijing, China
| | - Yunxiang Zhao
- Beijing Institute of Biotechnology, State Key Laboratory of Pathogen and Biosecurity, Beijing, China
| | - Xin Wang
- Beijing Institute of Biotechnology, State Key Laboratory of Pathogen and Biosecurity, Beijing, China
| | - Junjie Yue
- Beijing Institute of Biotechnology, State Key Laboratory of Pathogen and Biosecurity, Beijing, China
| | - Hongguang Ren
- Beijing Institute of Biotechnology, State Key Laboratory of Pathogen and Biosecurity, Beijing, China
| |
Collapse
|
3
|
Nagy NA, Tóth GE, Kurucz K, Kemenesi G, Laczkó L. The updated genome of the Hungarian population of Aedes koreicus. Sci Rep 2024; 14:7545. [PMID: 38555322 PMCID: PMC10981705 DOI: 10.1038/s41598-024-58096-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/11/2023] [Accepted: 03/25/2024] [Indexed: 04/02/2024] Open
Abstract
Vector-borne diseases pose a potential risk to human and animal welfare, and understanding their spread requires genomic resources. The mosquito Aedes koreicus is an emerging vector that has been introduced into Europe more than 15 years ago but only a low quality, fragmented genome was available. In this study, we carried out additional sequencing and assembled and characterized the genome of the species to provide a background for understanding its evolution and biology. The updated genome was 1.1 Gbp long and consisted of 6099 contigs with an N50 value of 329,610 bp and a BUSCO score of 84%. We identified 22,580 genes that could be functionally annotated and paid particular attention to the identification of potential insecticide resistance genes. The assessment of the orthology of the genes indicates a high turnover at the terminal branches of the species tree of mosquitoes with complete genomes, which could contribute to the adaptation and evolutionary success of the species. These results could form the basis for numerous downstream analyzes to develop targets for the control of mosquito populations.
Collapse
Affiliation(s)
- Nikoletta Andrea Nagy
- Department of Evolutionary Zoology and Human Biology, University of Debrecen, Debrecen, Hungary.
- HUN-REN-UD Behavioural Ecology Research Group, University of Debrecen, Debrecen, Hungary.
- Institute of Metagenomics, University of Debrecen, Debrecen, Hungary.
| | - Gábor Endre Tóth
- National Laboratory of Virology, Szentágothai Research Centre, University of Pécs, Pecs, Hungary
- Bernhard Nocht Institute for Tropical Medicine, WHO Collaborating Centre for Arbovirus and Hemorrhagic Fever Reference and Research, Hamburg, Germany
| | - Kornélia Kurucz
- National Laboratory of Virology, Szentágothai Research Centre, University of Pécs, Pecs, Hungary
- Institute of Biology, Faculty of Sciences, University of Pécs, Pecs, Hungary
| | - Gábor Kemenesi
- National Laboratory of Virology, Szentágothai Research Centre, University of Pécs, Pecs, Hungary
- Institute of Biology, Faculty of Sciences, University of Pécs, Pecs, Hungary
| | - Levente Laczkó
- HUN-REN-UD Conservation Biology Research Group, University of Debrecen, Debrecen, Hungary
- One Health Institute, University of Debrecen, Debrecen, Hungary
| |
Collapse
|
4
|
Wang F, Wang Y, Zeng X, Zhang S, Yu J, Li D, Zhang X. MIKE: an ultrafast, assembly-, and alignment-free approach for phylogenetic tree construction. Bioinformatics 2024; 40:btae154. [PMID: 38547397 PMCID: PMC10990684 DOI: 10.1093/bioinformatics/btae154] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/15/2023] [Revised: 02/06/2024] [Indexed: 04/05/2024] Open
Abstract
MOTIVATION Constructing a phylogenetic tree requires calculating the evolutionary distance between samples or species via large-scale resequencing data, a process that is both time-consuming and computationally demanding. Striking the right balance between accuracy and efficiency is a significant challenge. RESULTS To address this, we introduce a new algorithm, MIKE (MinHash-based k-mer algorithm). This algorithm is designed for the swift calculation of the Jaccard coefficient directly from raw sequencing reads and enables the construction of phylogenetic trees based on the resultant Jaccard coefficient. Simulation results highlight the superior speed of MIKE compared to existing state-of-the-art methods. We used MIKE to reconstruct a phylogenetic tree, incorporating 238 yeast, 303 Zea, 141 Ficus, 67 Oryza, and 43 Saccharum spontaneum samples. MIKE demonstrated accurate performance across varying evolutionary scales, reproductive modes, and ploidy levels, proving itself as a powerful tool for phylogenetic tree construction. AVAILABILITY AND IMPLEMENTATION MIKE is publicly available on Github at https://github.com/Argonum-Clever2/mike.git.
Collapse
Affiliation(s)
- Fang Wang
- College of Computer Science and Technology, Taiyuan University of Technology, Taiyuan, Shanxi 030024, China
- National Key Laboratory for Tropical Crop Breeding, Shenzhen Branch, Guangdong Laboratory for Lingnan Modern Agriculture, Genome Analysis Laboratory of the Ministry of Agriculture, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences, Shenzhen, Guangdong 518120, China
| | - Yibin Wang
- National Key Laboratory for Tropical Crop Breeding, Shenzhen Branch, Guangdong Laboratory for Lingnan Modern Agriculture, Genome Analysis Laboratory of the Ministry of Agriculture, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences, Shenzhen, Guangdong 518120, China
| | - Xiaofei Zeng
- Department of Human Cell Biology and Genetics, Joint Laboratory of Guangdong-Hong Kong Universities for Vascular Homeostasis and Diseases, School of Medicine, Southern University of Science and Technology, Shenzhen, Guangdong 508055, China
| | - Shengcheng Zhang
- National Key Laboratory for Tropical Crop Breeding, Shenzhen Branch, Guangdong Laboratory for Lingnan Modern Agriculture, Genome Analysis Laboratory of the Ministry of Agriculture, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences, Shenzhen, Guangdong 518120, China
| | - Jiaxin Yu
- National Key Laboratory for Tropical Crop Breeding, Shenzhen Branch, Guangdong Laboratory for Lingnan Modern Agriculture, Genome Analysis Laboratory of the Ministry of Agriculture, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences, Shenzhen, Guangdong 518120, China
| | - Dongxi Li
- College of Computer Science and Technology, Taiyuan University of Technology, Taiyuan, Shanxi 030024, China
| | - Xingtan Zhang
- National Key Laboratory for Tropical Crop Breeding, Shenzhen Branch, Guangdong Laboratory for Lingnan Modern Agriculture, Genome Analysis Laboratory of the Ministry of Agriculture, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences, Shenzhen, Guangdong 518120, China
| |
Collapse
|
5
|
Mirarab S, Bafna V. Analyses of Nuclear Reads Obtained Using Genome Skimming. Methods Mol Biol 2024; 2744:247-265. [PMID: 38683324 DOI: 10.1007/978-1-0716-3581-0_16] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 05/01/2024]
Abstract
In this protocol paper, we review a set of methods developed in recent years for analyzing nuclear reads obtained from genome skimming. As the cost of sequencing drops, genome skimming (low-coverage shotgun sequencing of a sample) becomes increasingly a cost-effective method of measuring biodiversity at high resolution. While most practitioners only use assembled over-represented organelle reads from a genome skim, the vast majority of the reads are nuclear. Using assembly-free and alignment-free methods described in this protocol, we can compare samples to each other and reference genomes to compute distances, characterize underlying genomes, and infer evolutionary relationships.
Collapse
Affiliation(s)
- Siavash Mirarab
- Electrical and Computer Engineering, University of California-San Diego, La Jolla, CA, USA.
| | - Vineet Bafna
- Computer Science and Engineering, University of California-San Diego, La Jolla, CA, USA
| |
Collapse
|
6
|
Shaw J, Yu YW. Fast and robust metagenomic sequence comparison through sparse chaining with skani. Nat Methods 2023; 20:1661-1665. [PMID: 37735570 PMCID: PMC10630134 DOI: 10.1038/s41592-023-02018-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/07/2023] [Accepted: 08/22/2023] [Indexed: 09/23/2023]
Abstract
Sequence comparison tools for metagenome-assembled genomes (MAGs) struggle with high-volume or low-quality data. We present skani ( https://github.com/bluenote-1577/skani ), a method for determining average nucleotide identity (ANI) via sparse approximate alignments. skani outperforms FastANI in accuracy and speed (>20× faster) for fragmented, incomplete MAGs. skani can query genomes against >65,000 prokaryotic genomes in seconds and 6 GB memory. skani unlocks higher-resolution insights for extensive, noisy metagenomic datasets.
Collapse
Affiliation(s)
- Jim Shaw
- Department of Mathematics, University of Toronto, Toronto, Ontario, Canada.
| | - Yun William Yu
- Department of Mathematics, University of Toronto, Toronto, Ontario, Canada.
- Computer and Mathematical Sciences, University of Toronto at Scarborough, Toronto, Ontario, Canada.
- Computational Biology Department, Carnegie Mellon University, Pittsburgh, PA, USA.
| |
Collapse
|
7
|
Fruzangohar M, Moolhuijzen P, Bakaj N, Taylor J. CoreDetector: a flexible and efficient program for core-genome alignment of evolutionary diverse genomes. Bioinformatics 2023; 39:btad628. [PMID: 37878789 PMCID: PMC10663985 DOI: 10.1093/bioinformatics/btad628] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/29/2023] [Revised: 09/20/2023] [Accepted: 10/23/2023] [Indexed: 10/27/2023] Open
Abstract
MOTIVATION Whole genome alignment of eukaryote species remains an important method for the determination of sequence and structural variations and can also be used to ascertain the representative non-redundant core-genome sequence of a population. Many whole genome alignment tools were first developed for the more mature analysis of prokaryote species with few current tools containing the functionality to process larger genomes of eukaryotes as well as genomes of more divergent species. In addition, the functionality of these tools becomes computationally prohibitive due to the significant compute resources needed to handle larger genomes. RESULTS In this research, we present CoreDetector, an easy-to-use general-purpose program that can align the core-genome sequences for a range of genome sizes and divergence levels. To illustrate the flexibility of CoreDetector, we conducted alignments of a large set of closely related fungal pathogen and hexaploid wheat cultivar genomes as well as more divergent fly and rodent species genomes. In all cases, compared to existing multiple genome alignment tools, CoreDetector exhibited improved flexibility, efficiency, and competitive accuracy in tested cases. AVAILABILITY AND IMPLEMENTATION CoreDetector was developed in the cross platform, and easily deployable, Java language. A packaged pipeline is readily executable in a bash terminal without any external need for Perl or Python environments. Installation, example data, and usage instructions for CoreDetector are freely available from https://github.com/mfruzan/CoreDetector.
Collapse
Affiliation(s)
- Mario Fruzangohar
- The Biometry Hub, School of Agriculture, Food and Wine, University of Adelaide, Urrbrae, South Australia 5064, Australia
| | - Paula Moolhuijzen
- Centre for Crop Disease Management, School of Molecular and Life Sciences, Curtin University, Bentley, Western Australia 6102, Australia
| | - Nicolette Bakaj
- The Biometry Hub, School of Agriculture, Food and Wine, University of Adelaide, Urrbrae, South Australia 5064, Australia
| | - Julian Taylor
- The Biometry Hub, School of Agriculture, Food and Wine, University of Adelaide, Urrbrae, South Australia 5064, Australia
| |
Collapse
|
8
|
Bandaranayake PCG, Naranpanawa N, Chandrasekara CHWMRB, Samarakoon H, Lokuge S, Jayasundara S, Bandaranayake AU, Pushpakumara DKNG, Wijesundara DSA. Chloroplast genome, nuclear ITS regions, mitogenome regions, and Skmer analysis resolved the genetic relationship among Cinnamomum species in Sri Lanka. PLoS One 2023; 18:e0291763. [PMID: 37729154 PMCID: PMC10511092 DOI: 10.1371/journal.pone.0291763] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/23/2023] [Accepted: 09/05/2023] [Indexed: 09/22/2023] Open
Abstract
Cinnamomum species have gained worldwide attention because of their economic benefits. Among them, C. verum (synonymous with C. zeylanicum Blume), commonly known as Ceylon Cinnamon or True Cinnamon is mainly produced in Sri Lanka. In addition, Sri Lanka is home to seven endemic wild cinnamon species, C. capparu-coronde, C. citriodorum, C. dubium, C. litseifolium, C. ovalifolium, C. rivulorum and C. sinharajaense. Proper identification and genetic characterization are fundamental for the conservation and commercialization of these species. While some species can be identified based on distinct morphological or chemical traits, others cannot be identified easily morphologically or chemically. The DNA barcoding using rbcL, matK, and trnH-psbA regions could not also resolve the identification of Cinnamomum species in Sri Lanka. Therefore, we generated Illumina Hiseq data of about 20x coverage for each identified species and a C. verum sample (India) and assembled the chloroplast genome, nuclear ITS regions, and several mitochondrial genes, and conducted Skmer analysis. Chloroplast genomes of all eight species were assembled using a seed-based method.According to the Bayesian phylogenomic tree constructed with the complete chloroplast genomes, the C. verum (Sri Lanka) is sister to previously sequenced C. verum (NC_035236.1, KY635878.1), C. dubium and C. rivulorum. The C. verum sample from India is sister to C. litseifolium and C. ovalifolium. According to the ITS regions studied, C. verum (Sri Lanka) is sister to C. verum (NC_035236.1), C. dubium and C. rivulorum. Cinnamomum verum (India) shares an identical ITS region with C. ovalifolium, C. litseifolium, C. citriodorum, and C. capparu-coronde. According to the Skmer analysis C. verum (Sri Lanka) is sister to C. dubium and C. rivulorum, whereas C. verum (India) is sister to C. ovalifolium, and C. litseifolium. The chloroplast gene ycf1 was identified as a chloroplast barcode for the identification of Cinnamomum species. We identified an 18 bp indel region in the ycf1 gene, that could differentiate C. verum (India) and C. verum (Sri Lanka) samples tested.
Collapse
Affiliation(s)
| | - Nathasha Naranpanawa
- Faculty of Agriculture, Agricultural Biotechnology Centre, University of Peradeniya, Peradeniya, Sri Lanka
- Postgraduate Institute of Science, University of Peradeniya, Peradeniya, Sri Lanka
| | | | - Hiruna Samarakoon
- Faculty of Agriculture, Agricultural Biotechnology Centre, University of Peradeniya, Peradeniya, Sri Lanka
| | - S. Lokuge
- Faculty of Agriculture, Agricultural Biotechnology Centre, University of Peradeniya, Peradeniya, Sri Lanka
| | - S. Jayasundara
- Faculty of Agriculture, Agricultural Biotechnology Centre, University of Peradeniya, Peradeniya, Sri Lanka
| | - Asitha U. Bandaranayake
- Faculty of Engineering, Department of Computer Engineering, University of Peradeniya, Peradeniya, Sri Lanka
| | - D. K. N. G. Pushpakumara
- Faculty of Agriculture, Department of Crop Science, University of Peradeniya, Peradeniya, Sri Lanka
| | | |
Collapse
|
9
|
Mo ZQ, Wang J, Möller M, Yang JB, Gao LM. Phylogenetic Relationships and Next-Generation Barcodes in the Genus Torreya Reveal a High Proportion of Misidentified Cultivated Plants. Int J Mol Sci 2023; 24:13216. [PMID: 37686021 PMCID: PMC10487542 DOI: 10.3390/ijms241713216] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/28/2023] [Revised: 08/20/2023] [Accepted: 08/22/2023] [Indexed: 09/10/2023] Open
Abstract
Accurate species identification is key to conservation and phylogenetic inference. Living plant collections from botanical gardens/arboretum are important resources for the purpose of scientific research, but the proportion of cultivated plant misidentification are un-tested using DNA barcodes. Here, we assembled the next-generation barcode (complete plastid genome and complete nrDNA cistron) and mitochondrial genes from genome skimming data of Torreya species with multiple accessions for each species to test the species discrimination and the misidentification proportion of cultivated plants used in Torreya studies. A total of 38 accessions were included for analyses, representing all nine recognized species of genus Torreya. The plastid phylogeny showed that all 21 wild samples formed species-specific clades, except T. jiulongshanensis. Disregarding this putative hybrid, seven recognized species sampled here were successfully discriminated by the plastid genome. Only the T. nucifera accessions grouped into two grades. The species identification rate of the nrDNA cistron was 62.5%. The Skmer analysis based on nuclear reads from genome skims showed promise for species identification with seven species discriminated. The proportion of misidentified cultivated plants from arboreta/botanical gardens was relatively high with four accessions (23.5%) representing three species. Interspecific relationships within Torreya were fully resolved with maximum support by plastomes, where Torreya jackii was on the earliest diverging branch, though sister to T. grandis in the nrDNA cistron tree, suggesting that this is likely a hybrid species between T. grandis and an extinct Torreya ancestor lineage. The findings here provide quantitative insights into the usage of cultivated samples for phylogenetic study.
Collapse
Affiliation(s)
- Zhi-Qiong Mo
- CAS Key Laboratory for Plant Diversity and Biogeography of East Asia, Kunming Institute of Botany, Chinese Academy of Sciences, Kunming 650201, China
- University of Chinese Academy of Sciences, Beijing 100049, China
| | - Jie Wang
- CAS Key Laboratory for Plant Diversity and Biogeography of East Asia, Kunming Institute of Botany, Chinese Academy of Sciences, Kunming 650201, China
- University of Chinese Academy of Sciences, Beijing 100049, China
- Germplasm Bank of Wild Species, Kunming Institute of Botany, Chinese Academy of Sciences, Kunming 650201, China
| | | | - Jun-Bo Yang
- Germplasm Bank of Wild Species, Kunming Institute of Botany, Chinese Academy of Sciences, Kunming 650201, China
| | - Lian-Ming Gao
- CAS Key Laboratory for Plant Diversity and Biogeography of East Asia, Kunming Institute of Botany, Chinese Academy of Sciences, Kunming 650201, China
- Lijiang Forest Biodiversity National Observation and Research Station, Kunming Institute of Botany, Chinese Academy of Sciences, Lijiang 674100, China
| |
Collapse
|
10
|
Pezzini FF, Ferrari G, Forrest LL, Hart ML, Nishii K, Kidner CA. Target capture and genome skimming for plant diversity studies. APPLICATIONS IN PLANT SCIENCES 2023; 11:e11537. [PMID: 37601316 PMCID: PMC10439825 DOI: 10.1002/aps3.11537] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 11/17/2022] [Revised: 06/16/2023] [Accepted: 07/10/2023] [Indexed: 08/22/2023]
Abstract
Recent technological advances in long-read high-throughput sequencing and assembly methods have facilitated the generation of annotated chromosome-scale whole-genome sequence data for evolutionary studies; however, generating such data can still be difficult for many plant species. For example, obtaining high-molecular-weight DNA is typically impossible for samples in historical herbarium collections, which often have degraded DNA. The need to fast-freeze newly collected living samples to conserve high-quality DNA can be complicated when plants are only found in remote areas. Therefore, short-read reduced-genome representations, such as target capture and genome skimming, remain important for evolutionary studies. Here, we review the pros and cons of each technique for non-model plant taxa. We provide guidance related to logistics, budget, the genomic resources previously available for the target clade, and the nature of the study. Furthermore, we assess the available bioinformatic analyses, detailing best practices and pitfalls, and suggest pathways to combine newly generated data with legacy data. Finally, we explore the possible downstream analyses allowed by the type of data generated using each technique. We provide a practical guide to help researchers make the best-informed choice regarding reduced genome representation for evolutionary studies of non-model plants in cases where whole-genome sequencing remains impractical.
Collapse
Affiliation(s)
| | - Giada Ferrari
- Royal Botanic Garden Edinburgh Edinburgh United Kingdom
| | | | | | - Kanae Nishii
- Royal Botanic Garden Edinburgh Edinburgh United Kingdom
| | - Catherine A Kidner
- Royal Botanic Garden Edinburgh Edinburgh United Kingdom
- School of Biological Sciences University of Edinburgh Edinburgh United Kingdom
| |
Collapse
|
11
|
Shaw J, Yu YW. Proving sequence aligners can guarantee accuracy in almost O( m log n) time through an average-case analysis of the seed-chain-extend heuristic. Genome Res 2023; 33:1175-1187. [PMID: 36990779 PMCID: PMC10538486 DOI: 10.1101/gr.277637.122] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/03/2023] [Accepted: 03/16/2023] [Indexed: 03/31/2023]
Abstract
Seed-chain-extend with k-mer seeds is a powerful heuristic technique for sequence alignment used by modern sequence aligners. Although effective in practice for both runtime and accuracy, theoretical guarantees on the resulting alignment do not exist for seed-chain-extend. In this work, we give the first rigorous bounds for the efficacy of seed-chain-extend with k-mers in expectation Assume we are given a random nucleotide sequence of length ∼n that is indexed (or seeded) and a mutated substring of length ∼m ≤ n with mutation rate θ < 0.206. We prove that we can find a k = Θ(log n) for the k-mer size such that the expected runtime of seed-chain-extend under optimal linear-gap cost chaining and quadratic time gap extension is O(mn f (θ) log n), where f(θ) < 2.43 · θ holds as a loose bound. The alignment also turns out to be good; we prove that more than [Formula: see text] fraction of the homologous bases is recoverable under an optimal chain. We also show that our bounds work when k-mers are sketched, that is, only a subset of all k-mers is selected, and that sketching reduces chaining time without increasing alignment time or decreasing accuracy too much, justifying the effectiveness of sketching as a practical speedup in sequence alignment. We verify our results in simulation and on real noisy long-read data and show that our theoretical runtimes can predict real runtimes accurately. We conjecture that our bounds can be improved further, and in particular, f(θ) can be further reduced.
Collapse
Affiliation(s)
- Jim Shaw
- Department of Mathematics, University of Toronto, Toronto, Ontario M5S 2E4, Canada;
| | - Yun William Yu
- Department of Mathematics, University of Toronto, Toronto, Ontario M5S 2E4, Canada
- Computer and Mathematical Sciences, University of Toronto at Scarborough, Toronto, Ontario M1C 1A4, Canada
| |
Collapse
|
12
|
Pouchon C, Boluda CG. REFMAKER: make your own reference to target nuclear loci in low coverage genome skimming libraries. Phylogenomic application in Sapotaceae. Mol Phylogenet Evol 2023:107826. [PMID: 37257798 DOI: 10.1016/j.ympev.2023.107826] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/18/2023] [Revised: 04/24/2023] [Accepted: 05/25/2023] [Indexed: 06/02/2023]
Abstract
Genome skimming approach is widely used in plant systematics to infer phylogenies mostly from organelle genomes. However, organelles represent only 10% of the produced libraries, and the low coverage associated with these libraries (< 3X) prevents the capture of nuclear sequences, which are not always available in non-model organisms or limited to the ribosomal regions. We developed REFMAKER, a user-friendly pipeline, to create specific sets of nuclear loci that can next be extracted directly from the genome skimming libraries. For this, a catalogue is built from the meta-assembly of each library contigs and cleaned by selecting the nuclear regions and removing duplicates from clustering steps. Libraries are next mapped onto this catalogue and consensus sequences are generated to produce a ready-to-use phylogenetic matrix following different filtering parameters aiming at removing putative errors and paralogous sequences. REFMAKER allowed us to infer a well resolved phylogeny in Capurodendron (Sapotaceae) on 67 nuclear loci from low-coverage libraries (<1X). The resulting phylogeny is concomitant with one previously inferred on 638 nuclear genes from target enrichment libraries. While it remains preliminary because of this low sequencing depth, REFMAKER therefore opens perspectives in phylogenomics by allowing nuclear phylogeny reconstructions with genome skimming datasets.
Collapse
Affiliation(s)
- Charles Pouchon
- Conservatoire et Jardin botaniques de la Ville de Genève, Chemin de l'Impératrice 1, 1292 Chambésy, Geneva, Switzerland; PhyloLab, Department of Plant Sciences, Université de Genève, Chemin de l'Impératrice 1, 1292 Chambésy, Geneva, Switzerland.
| | - Carlos G Boluda
- Conservatoire et Jardin botaniques de la Ville de Genève, Chemin de l'Impératrice 1, 1292 Chambésy, Geneva, Switzerland; PhyloLab, Department of Plant Sciences, Université de Genève, Chemin de l'Impératrice 1, 1292 Chambésy, Geneva, Switzerland
| |
Collapse
|
13
|
Paula DP, Andow DA. DNA High-Throughput Sequencing for Arthropod Gut Content Analysis to Evaluate Effectiveness and Safety of Biological Control Agents. NEOTROPICAL ENTOMOLOGY 2023; 52:302-332. [PMID: 36478343 DOI: 10.1007/s13744-022-01011-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 06/27/2022] [Accepted: 11/20/2022] [Indexed: 06/17/2023]
Abstract
The search for effective biological control agents without harmful non-target effects has been constrained by the use of impractical (field direct observation) or imprecise (cage experiments) methods. While advances in the DNA sequencing methods, more specifically the development of high-throughput sequencing (HTS), have been quickly incorporated in biodiversity surveys, they have been slow to be adopted to determine arthropod prey range, predation rate and food web structure, and critical information to evaluate the effectiveness and safety of a biological control agent candidate. The lack of knowledge on how HTS methods could be applied by ecological entomologists constitutes part of the problem, although the lack of expertise and the high cost of the analysis also are important limiting factors. In this review, we describe how the latest HTS methods of metabarcoding and Lazaro, a method to identify prey by mapping unassembled shotgun reads, can serve biological control research, showing both their power and limitations. We explain how they work to determine prey range and also how their data can be used to estimate predation rates and subsequently be translated into food webs of natural enemy and prey populations helping to elucidate their role in the community. We present a brief history of prey detection through molecular gut content analysis and also the attempts to develop a more precise formula to estimate predation rates, a problem that still remains. We focused on arthropods in agricultural ecosystems, but most of what is covered here can be applied to natural systems and non-arthropod biological control candidates as well.
Collapse
|
14
|
Raiyemo DA, Bobadilla LK, Tranel PJ. Genomic profiling of dioecious Amaranthus species provides novel insights into species relatedness and sex genes. BMC Biol 2023; 21:37. [PMID: 36804015 PMCID: PMC9940365 DOI: 10.1186/s12915-023-01539-9] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/21/2022] [Accepted: 02/08/2023] [Indexed: 02/21/2023] Open
Abstract
BACKGROUND Amaranthus L. is a diverse genus consisting of domesticated, weedy, and non-invasive species distributed around the world. Nine species are dioecious, of which Amaranthus palmeri S. Watson and Amaranthus tuberculatus (Moq.) J.D. Sauer are troublesome weeds of agronomic crops in the USA and elsewhere. Shallow relationships among the dioecious Amaranthus species and the conservation of candidate genes within previously identified A. palmeri and A. tuberculatus male-specific regions of the Y (MSYs) in other dioecious species are poorly understood. In this study, seven genomes of dioecious amaranths were obtained by paired-end short-read sequencing and combined with short reads of seventeen species in the family Amaranthaceae from NCBI database. The species were phylogenomically analyzed to understand their relatedness. Genome characteristics for the dioecious species were evaluated and coverage analysis was used to investigate the conservation of sequences within the MSY regions. RESULTS We provide genome size, heterozygosity, and ploidy level inference for seven newly sequenced dioecious Amaranthus species and two additional dioecious species from the NCBI database. We report a pattern of transposable element proliferation in the species, in which seven species had more Ty3 elements than copia elements while A. palmeri and A. watsonii had more copia elements than Ty3 elements, similar to the TE pattern in some monoecious amaranths. Using a Mash-based phylogenomic analysis, we accurately recovered taxonomic relationships among the dioecious Amaranthus species that were previously identified based on comparative morphology. Coverage analysis revealed eleven candidate gene models within the A. palmeri MSY region with male-enriched coverages, as well as regions on scaffold 19 with female-enriched coverage, based on A. watsonii read alignments. A previously reported FLOWERING LOCUS T (FT) within A. tuberculatus MSY contig was also found to exhibit male-enriched coverages for three species closely related to A. tuberculatus but not for A. watsonii reads. Additional characterization of the A. palmeri MSY region revealed that 78% of the region is made of repetitive elements, typical of a sex determination region with reduced recombination. CONCLUSIONS The results of this study further increase our understanding of the relationships among the dioecious species of the Amaranthus genus as well as revealed genes with potential roles in sex function in the species.
Collapse
Affiliation(s)
- Damilola A Raiyemo
- Department of Crop Sciences, University of Illinois, Urbana, IL, 61801, USA
| | - Lucas K Bobadilla
- Department of Crop Sciences, University of Illinois, Urbana, IL, 61801, USA
| | - Patrick J Tranel
- Department of Crop Sciences, University of Illinois, Urbana, IL, 61801, USA.
| |
Collapse
|
15
|
Anjum N, Nabil RL, Rafi RI, Bayzid MS, Rahman MS. CD-MAWS: An Alignment-Free Phylogeny Estimation Method Using Cosine Distance on Minimal Absent Word Sets. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2023; 20:196-205. [PMID: 34928803 DOI: 10.1109/tcbb.2021.3136792] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/04/2023]
Abstract
Multiple sequence alignment has been the traditional and well established approach of sequence analysis and comparison, though it is time and memory consuming. As the scale of sequencing data is increasing day by day, the importance of faster yet accurate alignment-free methods is on the rise. Several alignment-free sequence analysis methods have been established in the literature in recent years, which extract numerical features from genomic data to analyze sequences and also to estimate phylogenetic relationship among genes and species. Minimal Absent Word (MAW) is an effective concept for representing characteristics of a sequence in an alignment-free manner. In this study, we present CD-MAWS, a distance measure based on cosine of the angle between composition vectors constructed using minimal absent words, for sequence analysis in a computationally inexpensive manner. We have benchmarked CD-MAWS using several AFProject datasets, such as Fish mtDNA, E.coli, Plants, Shigella and Yersinia datasets, and found it to perform quite well. Applied on several other biological datasets such as mammal mtDNA, bacterial genomes and viral genomes, CD-MAWS resolved phylogenetic relationships similar to or better than state-of-the-art alignment-free methods such as Mash, Skmer, Co-phylog and kSNP3.
Collapse
|
16
|
Rachtman E, Sarmashghi S, Bafna V, Mirarab S. Quantifying the uncertainty of assembly-free genome-wide distance estimates and phylogenetic relationships using subsampling. Cell Syst 2022; 13:817-829.e3. [PMID: 36265468 PMCID: PMC9589918 DOI: 10.1016/j.cels.2022.06.007] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/24/2021] [Revised: 03/14/2022] [Accepted: 06/28/2022] [Indexed: 01/26/2023]
Abstract
Computing distance between two genomes without alignments or even access to assemblies has many downstream analyses. However, alignment-free methods, including in the fast-growing field of genome skimming, are hampered by a significant methodological gap. While accurate methods (many k-mer-based) for assembly-free distance calculation exist, measuring the uncertainty of estimated distances has not been sufficiently studied. In this paper, we show that bootstrapping, the standard non-parametric method of measuring estimator uncertainty, is not accurate for k-mer-based methods that rely on k-mer frequency profiles. Instead, we propose using subsampling (with no replacement) in combination with a correction step to reduce the variance of the inferred distribution. We show that the distribution of distances using our procedure matches the true uncertainty of the estimator. The resulting phylogenetic support values effectively differentiate between correct and incorrect branches and identify controversial branches that change across alignment-free and alignment-based phylogenies reported in the literature.
Collapse
Affiliation(s)
- Eleonora Rachtman
- Bioinformatics and Systems Biology Graduate Program, UC San Diego, San Diego, CA 92093, USA
| | - Shahab Sarmashghi
- Department of Electrical and Computer Engineering, UC San Diego, San Diego, CA 92093, USA
| | - Vineet Bafna
- Department of Computer Science and Engineering, UC San Diego, San Diego, CA 92093, USA
| | - Siavash Mirarab
- Department of Electrical and Computer Engineering, UC San Diego, San Diego, CA 92093, USA.
| |
Collapse
|
17
|
Balaban M, Bristy NA, Faisal A, Bayzid MS, Mirarab S. Genome-wide alignment-free phylogenetic distance estimation under a no strand-bias model. BIOINFORMATICS ADVANCES 2022; 2:vbac055. [PMID: 35992043 PMCID: PMC9383262 DOI: 10.1093/bioadv/vbac055] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 06/21/2022] [Accepted: 08/09/2022] [Indexed: 01/27/2023]
Abstract
While alignment has been the dominant approach for determining homology prior to phylogenetic inference, alignment-free methods can simplify the analysis, especially when analyzing genome-wide data. Furthermore, alignment-free methods present the only option for emerging forms of data, such as genome skims, which do not permit assembly. Despite the appeal, alignment-free methods have not been competitive with alignment-based methods in terms of accuracy. One limitation of alignment-free methods is their reliance on simplified models of sequence evolution such as Jukes-Cantor. If we can estimate frequencies of base substitutions in an alignment-free setting, we can compute pairwise distances under more complex models. However, since the strand of DNA sequences is unknown for many forms of genome-wide data, which arguably present the best use case for alignment-free methods, the most complex models that one can use are the so-called no strand-bias models. We show how to calculate distances under a four-parameter no strand-bias model called TK4 without relying on alignments or assemblies. The main idea is to replace letters in the input sequences and recompute Jaccard indices between k-mer sets. However, on larger genomes, we also need to compute the number of k-mer mismatches after replacement due to random chance as opposed to homology. We show in simulation that alignment-free distances can be highly accurate when genomes evolve under the assumed models and study the accuracy on assembled and unassembled biological data. Availability and implementation Our software is available open source at https://github.com/nishatbristy007/NSB. Supplementary information Supplementary data are available at Bioinformatics Advances online.
Collapse
Affiliation(s)
| | | | - Ahnaf Faisal
- Computer Science and Engineering, Bangladesh University of Engineering and Technology, Dhaka 1205, Bangladesh
| | - Md Shamsuzzoha Bayzid
- Computer Science and Engineering, Bangladesh University of Engineering and Technology, Dhaka 1205, Bangladesh
| | | |
Collapse
|
18
|
Paula DP, Timbó RV, Togawa RC, Vogler AP, Andow DA. Quantitative prey species detection in predator guts across multiple trophic levels by mapping unassembled shotgun reads. Mol Ecol Resour 2022; 23:64-80. [DOI: 10.1111/1755-0998.13690] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/23/2021] [Revised: 06/11/2022] [Accepted: 07/05/2022] [Indexed: 11/29/2022]
Affiliation(s)
- Débora P. Paula
- Embrapa Recursos Genéticos e Biotecnologia Brasília DF Brazil
| | - Renata V. Timbó
- Embrapa Recursos Genéticos e Biotecnologia Brasília DF Brazil
- Universidade de Brasília, Campus Universitário Darcy Ribeiro Brasília DF Brazil
| | | | - Alfried P. Vogler
- Imperial College London Ascot UK
- Department of Life Sciences Natural History Museum London UK
| | - David A. Andow
- Department of Entomology University of Minnesota St. Paul USA
| |
Collapse
|
19
|
Xu T, Kong L, Li Q. Testing Efficacy of Assembly-Free and Alignment-Free Methods for Species Identification Using Genome Skims, with Patellogastropoda as a Test Case. Genes (Basel) 2022; 13:genes13071192. [PMID: 35885975 PMCID: PMC9318368 DOI: 10.3390/genes13071192] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/17/2022] [Revised: 06/26/2022] [Accepted: 06/28/2022] [Indexed: 02/05/2023] Open
Abstract
Most recently, species identification has leaped from DNA barcoding into shotgun sequencing-based “genome skimming” alternatives. Genome skims have mainly been used to assemble organelle genomes, which discards much of the nuclear genome. Recently, an alternative approach was proposed for sample identification, using unassembled genome skims, which can effectively improve phylogenetic signal and identification resolution. Studies have shown that the software Skmer and APPLES work well at estimating genomic distance and performing phylogenetic placement in birds and insects using low-coverage genome skims. In this study, we use Skmer and APPLES based on genome skims of 11 patellogastropods to perform assembly-free and alignment-free species identification and phylogenetic placement. Whether or not data corresponding to query species are present in the reference database, Skmer selects the best matching or closest species with COI barcodes under different sizes of genome skims except lacking species belonging to the same family as a query. APPLES cannot place patellogastropods in the correct phylogenetic position when the reference database is sparse. Our study represents the first attempt at assembly-free and alignment-free species identification of marine mollusks using genome skims, demonstrating its feasibility for patellogastropod species identification and flanking the necessity of establishing a database to share genome skims.
Collapse
Affiliation(s)
- Tao Xu
- Key Laboratory of Mariculture, Ministry of Education, Ocean University of China, 5 Yushan Road, Qingdao 266003, China; (T.X.); (Q.L.)
| | - Lingfeng Kong
- Key Laboratory of Mariculture, Ministry of Education, Ocean University of China, 5 Yushan Road, Qingdao 266003, China; (T.X.); (Q.L.)
- Laboratory for Marine Fisheries Science and Food Production Processes, Qingdao National Laboratory for Marine Science and Technology, 5 Yushan Road, Qingdao 266003, China
- Correspondence:
| | - Qi Li
- Key Laboratory of Mariculture, Ministry of Education, Ocean University of China, 5 Yushan Road, Qingdao 266003, China; (T.X.); (Q.L.)
- Laboratory for Marine Fisheries Science and Food Production Processes, Qingdao National Laboratory for Marine Science and Technology, 5 Yushan Road, Qingdao 266003, China
| |
Collapse
|
20
|
Schmidt A, Schneider C, Decker P, Hohberg K, Römbke J, Lehmitz R, Bálint M. Shotgun metagenomics of soil invertebrate communities reflects taxonomy, biomass, and reference genome properties. Ecol Evol 2022; 12:e8991. [PMID: 35784064 PMCID: PMC9170594 DOI: 10.1002/ece3.8991] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/27/2021] [Revised: 05/11/2022] [Accepted: 05/17/2022] [Indexed: 12/03/2022] Open
Abstract
Metagenomics - shotgun sequencing of all DNA fragments from a community DNA extract - is routinely used to describe the composition, structure, and function of microorganism communities. Advances in DNA sequencing and the availability of genome databases increasingly allow the use of shotgun metagenomics on eukaryotic communities. Metagenomics offers major advances in the recovery of biomass relationships in a sample, in comparison to taxonomic marker gene-based approaches (metabarcoding). However, little is known about the factors which influence metagenomics data from eukaryotic communities, such as differences among organism groups, the properties of reference genomes, and genome assemblies.We evaluated how shotgun metagenomics records composition and biomass in artificial soil invertebrate communities at different sequencing efforts. We generated mock communities of controlled biomass ratios from 28 species from all major soil mesofauna groups: mites, springtails, nematodes, tardigrades, and potworms. We shotgun sequenced these communities and taxonomically assigned them with a database of over 270 soil invertebrate genomes.We recovered over 95% of the species, and observed relatively high false-positive detection rates. We found strong differences in reads assigned to different taxa, with some groups (e.g., springtails) consistently attracting more hits than others (e.g., enchytraeids). Original biomass could be predicted from read counts after considering these taxon-specific differences. Species with larger genomes, and with more complete assemblies, consistently attracted more reads than species with smaller genomes. The GC content of the genome assemblies had no effect on the biomass-read relationships. Results were similar among different sequencing efforts.The results show considerable differences in taxon recovery and taxon specificity of biomass recovery from metagenomic sequence data. The properties of reference genomes and genome assemblies also influence biomass recovery, and they should be considered in metagenomic studies of eukaryotes. We show that low- and high-sequencing efforts yield similar results, suggesting high cost-efficiency of metagenomics for eukaryotic communities. We provide a brief roadmap for investigating factors which influence metagenomics-based eukaryotic community reconstructions. Understanding these factors is timely as accessibility of DNA sequencing and momentum for reference genomes projects show a future where the taxonomic assignment of DNA from any community sample becomes a reality.
Collapse
Affiliation(s)
- Alexandra Schmidt
- Senckenberg Biodiversity Climate Research CenterFrankfurt am MainGermany
- Biology DepartmentJ.W. Goethe UniversityFrankfurt am MainGermany
- Loewe Center for Translational Biodiversity Genomics (LOEWE‐TBG)Frankfurt am MainGermany
- Limnological Institute (Environmental Genomics)University of KonstanzKonstanzGermany
| | - Clément Schneider
- Loewe Center for Translational Biodiversity Genomics (LOEWE‐TBG)Frankfurt am MainGermany
- Soil Zoology DepartmentSenckenberg Museum of Natural History GörlitzGörlitzGermany
| | - Peter Decker
- Loewe Center for Translational Biodiversity Genomics (LOEWE‐TBG)Frankfurt am MainGermany
- Blumenstr. 5GörlitzGermany
| | - Karin Hohberg
- Loewe Center for Translational Biodiversity Genomics (LOEWE‐TBG)Frankfurt am MainGermany
- Soil Zoology DepartmentSenckenberg Museum of Natural History GörlitzGörlitzGermany
| | - Jörg Römbke
- ECT Oekotoxikologie GmbHFlörsheim am MainGermany
| | - Ricarda Lehmitz
- Loewe Center for Translational Biodiversity Genomics (LOEWE‐TBG)Frankfurt am MainGermany
- Soil Zoology DepartmentSenckenberg Museum of Natural History GörlitzGörlitzGermany
| | - Miklós Bálint
- Senckenberg Biodiversity Climate Research CenterFrankfurt am MainGermany
- Loewe Center for Translational Biodiversity Genomics (LOEWE‐TBG)Frankfurt am MainGermany
- Institute for Insect BiotechnologyJustus Liebig UniversityGießenGermany
| |
Collapse
|
21
|
Belbasi M, Blanca A, Harris RS, Koslicki D, Medvedev P. The minimizer Jaccard estimator is biased and inconsistent. Bioinformatics 2022; 38:i169-i176. [PMID: 35758786 PMCID: PMC9235516 DOI: 10.1093/bioinformatics/btac244] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/30/2022] Open
Abstract
Motivation Sketching is now widely used in bioinformatics to reduce data size and increase data processing speed. Sketching approaches entice with improved scalability but also carry the danger of decreased accuracy and added bias. In this article, we investigate the minimizer sketch and its use to estimate the Jaccard similarity between two sequences. Results We show that the minimizer Jaccard estimator is biased and inconsistent, which means that the expected difference (i.e. the bias) between the estimator and the true value is not zero, even in the limit as the lengths of the sequences grow. We derive an analytical formula for the bias as a function of how the shared k-mers are laid out along the sequences. We show both theoretically and empirically that there are families of sequences where the bias can be substantial (e.g. the true Jaccard can be more than double the estimate). Finally, we demonstrate that this bias affects the accuracy of the widely used mashmap read mapping tool. Availability and implementation Scripts to reproduce our experiments are available at https://github.com/medvedevgroup/minimizer-jaccard-estimator/tree/main/reproduce. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Mahdi Belbasi
- Department of Computer Science and Engineering, The Pennsylvania State University, University Park, PA, USA
| | - Antonio Blanca
- Department of Computer Science and Engineering, The Pennsylvania State University, University Park, PA, USA
| | - Robert S Harris
- Department of Biology, The Pennsylvania State University, University Park, PA, USA
| | - David Koslicki
- Department of Computer Science and Engineering, The Pennsylvania State University, University Park, PA, USA.,Department of Biology, The Pennsylvania State University, University Park, PA, USA.,Huck Institutes of the Life Sciences, The Pennsylvania State University, University Park, PA, USA
| | - Paul Medvedev
- Department of Computer Science and Engineering, The Pennsylvania State University, University Park, PA, USA.,Huck Institutes of the Life Sciences, The Pennsylvania State University, University Park, PA, USA.,Department of Biochemistry and Molecular Biology, The Pennsylvania State University, University Park, PA, USA
| |
Collapse
|
22
|
Liu S, Koslicki D. CMash: fast, multi-resolution estimation of k-mer-based Jaccard and containment indices. Bioinformatics 2022; 38:i28-i35. [PMID: 35758788 PMCID: PMC9235470 DOI: 10.1093/bioinformatics/btac237] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022] Open
Abstract
Motivation K-mer-based methods are used ubiquitously in the field of computational biology. However, determining the optimal value of k for a specific application often remains heuristic. Simply reconstructing a new k-mer set with another k-mer size is computationally expensive, especially in metagenomic analysis where datasets are large. Here, we introduce a hashing-based technique that leverages a kind of bottom-m sketch as well as a k-mer ternary search tree (KTST) to obtain k-mer-based similarity estimates for a range of k values. By truncating k-mers stored in a pre-built KTST with a large k=kmax value, we can simultaneously obtain k-mer-based estimates for all k values up to kmax. This truncation approach circumvents the reconstruction of new k-mer sets when changing k values, making analysis more time and space-efficient. Results We derived the theoretical expression of the bias factor due to truncation. And we showed that the biases are negligible in practice: when using a KTST to estimate the containment index between a RefSeq-based microbial reference database and simulated metagenome data for 10 values of k, the running time was close to 10× faster compared to a classic MinHash approach while using less than one-fifth the space to store the data structure. Availability and implementation A python implementation of this method, CMash, is available at https://github.com/dkoslicki/CMash. The reproduction of all experiments presented herein can be accessed via https://github.com/KoslickiLab/CMASH-reproducibles. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Shaopeng Liu
- Huck Institutes of Life Sciences, Pennsylvania State University, State College, PA 16801, USA
| | - David Koslicki
- Huck Institutes of Life Sciences, Pennsylvania State University, State College, PA 16801, USA.,Department of Computer Science and Engineering, Pennsylvania State University, State College, PA 16801, USA.,Department of Biology, Pennsylvania State University, State College, PA 16801, USA
| |
Collapse
|
23
|
Cay SB, Cinar YU, Kuralay SC, Inal B, Zararsiz G, Ciftci A, Mollman R, Obut O, Eldem V, Bakir Y, Erol O. Genome skimming approach reveals the gene arrangements in the chloroplast genomes of the highly endangered Crocus L. species: Crocus istanbulensis (B.Mathew) Rukšāns. PLoS One 2022; 17:e0269747. [PMID: 35704623 PMCID: PMC9200356 DOI: 10.1371/journal.pone.0269747] [Citation(s) in RCA: 4] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/17/2021] [Accepted: 05/27/2022] [Indexed: 11/19/2022] Open
Abstract
Crocus istanbulensis (B.Mathew) Rukšāns is one of the most endangered Crocus species in the world and has an extremely limited distribution range in Istanbul. Our recent field work indicates that no more than one hundred individuals remain in the wild. In the present study, we used genome skimming to determine the complete chloroplast (cp) genome sequences of six C. istanbulensis individuals collected from the locus classicus. The cp genome of C. istanbulensis has 151,199 base pairs (bp), with a large single-copy (LSC) (81,197 bp), small single copy (SSC) (17,524 bp) and two inverted repeat (IR) regions of 26,236 bp each. The cp genome contains 132 genes, of which 86 are protein-coding (PCGs), 8 are rRNA and 38 are tRNA genes. Most of the repeats are found in intergenic spacers of Crocus species. Mononucleotide repeats were most abundant, accounting for over 80% of total repeats. The cp genome contained four palindrome repeats and one forward repeat. Comparative analyses among other Iridaceae species identified one inversion in the terminal positions of LSC region and three different gene (psbA, rps3 and rpl22) arrangements in C. istanbulensis that were not reported previously. To measure selective pressure in the exons of chloroplast coding sequences, we performed a sequence analysis of plastome-encoded genes. A total of seven genes (accD, rpoC2, psbK, rps12, ccsA, clpP and ycf2) were detected under positive selection in the cp genome. Alignment-free sequence comparison showed an extremely low sequence diversity across naturally occurring C. istanbulensis specimens. All six sequenced individuals shared the same cp haplotype. In summary, this study will aid further research on the molecular evolution and development of ex situ conservation strategies of C. istanbulensis.
Collapse
Affiliation(s)
- Selahattin Baris Cay
- Department of Biology, Faculty of Sciences, Istanbul University, Istanbul, Turkey
| | - Yusuf Ulas Cinar
- Department of Biology, Faculty of Sciences, Istanbul University, Istanbul, Turkey
| | - Selim Can Kuralay
- Department of Biology, Faculty of Sciences, Istanbul University, Istanbul, Turkey
| | - Behcet Inal
- Department of Agricultural Biotechnology, Faculty of Agriculture, University of Siirt, Siirt, Turkey
| | - Gokmen Zararsiz
- Department of Biostatistics, Erciyes University, Kayseri, Turkey
- Drug Application and Research Center (ERFARMA), Erciyes University, Kayseri, Turkey
| | - Almila Ciftci
- Department of Biology, Faculty of Sciences, Istanbul University, Istanbul, Turkey
| | - Rachel Mollman
- Department of Biology, Faculty of Sciences, Istanbul University, Istanbul, Turkey
| | - Onur Obut
- Department of Biology, Faculty of Sciences, Istanbul University, Istanbul, Turkey
| | - Vahap Eldem
- Department of Biology, Faculty of Sciences, Istanbul University, Istanbul, Turkey
- * E-mail:
| | - Yakup Bakir
- Department of Plant Bioactive Metabolites, ACTV Biotechnology, Inc., Istanbul, Turkey
| | - Osman Erol
- Department of Biology, Faculty of Sciences, Istanbul University, Istanbul, Turkey
| |
Collapse
|
24
|
Li X, Wang X, Huang R, Stucky A, Chen X, Sun L, Wen Q, Zeng Y, Fletcher H, Wang C, Xu Y, Cao H, Sun F, Li SC, Zhang X, Zhong JF. The Machine-Learning-Mediated Interface of Microbiome and Genetic Risk Stratification in Neuroblastoma Reveals Molecular Pathways Related to Patient Survival. Cancers (Basel) 2022; 14:cancers14122874. [PMID: 35740540 PMCID: PMC9220810 DOI: 10.3390/cancers14122874] [Citation(s) in RCA: 4] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/15/2022] [Revised: 05/23/2022] [Accepted: 05/30/2022] [Indexed: 02/01/2023] Open
Abstract
Simple Summary Neuroblastoma is a highly heterogeneous malignancy with a wide range of outcomes from spontaneous regression to fatal chemoresistant disease, as currently treated according to the risk stratification of the Children’s Oncology Group (COG), resulting in some high COG risk patients receiving excessive treatment, due to lacking predictors for treatment response. Here, we sought to complement COG risk classification by using the tumor intracellular microbiome, which is part of the tumor’s molecular signature. We determine that an intra-tumor microbial gene abundance score, namely M-score, separates the high COG-risk patients into two subpopulations (Mhigh and Mlow) with higher accuracy in risk stratification than the current COG risk assessment, thus sparing a subset of high COG-risk patients from being subjected to traditional high-risk therapies. Abstract Currently, most neuroblastoma patients are treated according to the Children’s Oncology Group (COG) risk group assignment; however, neuroblastoma’s heterogeneity renders only a few predictors for treatment response, resulting in excessive treatment. Here, we sought to couple COG risk classification with tumor intracellular microbiome, which is part of the molecular signature of a tumor. We determine that an intra-tumor microbial gene abundance score, namely M-score, separates the high COG-risk patients into two subpopulations (Mhigh and Mlow) with higher accuracy in risk stratification than the current COG risk assessment, thus sparing a subset of high COG-risk patients from being subjected to traditional high-risk therapies. Mechanistically, the classification power of M-scores implies the effect of CREB over-activation, which may influence the critical genes involved in cellular proliferation, anti-apoptosis, and angiogenesis, affecting tumor cell proliferation survival and metastasis. Thus, intracellular microbiota abundance in neuroblastoma regulates intracellular signals to affect patients’ survival.
Collapse
Affiliation(s)
- Xin Li
- Department of Basic Science, School of Medicine, Loma Linda University, Loma Linda, CA 92350, USA; (X.L.); (A.S.); (X.C.); (H.F.); (C.W.)
| | - Xiaoqi Wang
- Medical Center of Hematology, Xinqiao Hospital, State Key Laboratory of Trauma, Burn and Combined Injury, Army Medical University, Chongqing 400037, China; (X.W.); (R.H.); (Q.W.); (Y.Z.)
| | - Ruihao Huang
- Medical Center of Hematology, Xinqiao Hospital, State Key Laboratory of Trauma, Burn and Combined Injury, Army Medical University, Chongqing 400037, China; (X.W.); (R.H.); (Q.W.); (Y.Z.)
| | - Andres Stucky
- Department of Basic Science, School of Medicine, Loma Linda University, Loma Linda, CA 92350, USA; (X.L.); (A.S.); (X.C.); (H.F.); (C.W.)
| | - Xuelian Chen
- Department of Basic Science, School of Medicine, Loma Linda University, Loma Linda, CA 92350, USA; (X.L.); (A.S.); (X.C.); (H.F.); (C.W.)
| | - Lan Sun
- Department of Oncology, Bishan Hospital of Chongqing Medical University, the People’s Hospital of Bishan District, Chongqing 400037, China;
| | - Qin Wen
- Medical Center of Hematology, Xinqiao Hospital, State Key Laboratory of Trauma, Burn and Combined Injury, Army Medical University, Chongqing 400037, China; (X.W.); (R.H.); (Q.W.); (Y.Z.)
| | - Yunjing Zeng
- Medical Center of Hematology, Xinqiao Hospital, State Key Laboratory of Trauma, Burn and Combined Injury, Army Medical University, Chongqing 400037, China; (X.W.); (R.H.); (Q.W.); (Y.Z.)
| | - Hansel Fletcher
- Department of Basic Science, School of Medicine, Loma Linda University, Loma Linda, CA 92350, USA; (X.L.); (A.S.); (X.C.); (H.F.); (C.W.)
| | - Charles Wang
- Department of Basic Science, School of Medicine, Loma Linda University, Loma Linda, CA 92350, USA; (X.L.); (A.S.); (X.C.); (H.F.); (C.W.)
| | - Yi Xu
- Divisions of Hematology and Oncology and Regenerative Medicine, Department of Medicine, Loma Linda University, Loma Linda, CA 92350, USA; (Y.X.); (H.C.)
- Cancer Center of Loma Linda University, Loma Linda, CA 92350, USA
| | - Huynh Cao
- Divisions of Hematology and Oncology and Regenerative Medicine, Department of Medicine, Loma Linda University, Loma Linda, CA 92350, USA; (Y.X.); (H.C.)
- Cancer Center of Loma Linda University, Loma Linda, CA 92350, USA
| | - Fengzhu Sun
- Quantitative and Computational Biology Department, University of Southern California, Los Angeles, CA 90089, USA;
| | - Shengwen Calvin Li
- CHOC Children’s Research Institute, Children’s Hospital of Orange County (CHOC), 1201 La Veta Ave., Orange, CA 92868-3874, USA
- Department of Neurology, University of California—Irvine School of Medicine, 200 S. Manchester Ave. Ste. 206, Orange, CA 92868, USA
- Correspondence: (S.C.L.); (X.Z.); (J.F.Z.)
| | - Xi Zhang
- Medical Center of Hematology, Xinqiao Hospital, State Key Laboratory of Trauma, Burn and Combined Injury, Army Medical University, Chongqing 400037, China; (X.W.); (R.H.); (Q.W.); (Y.Z.)
- Correspondence: (S.C.L.); (X.Z.); (J.F.Z.)
| | - Jiang F. Zhong
- Department of Basic Science, School of Medicine, Loma Linda University, Loma Linda, CA 92350, USA; (X.L.); (A.S.); (X.C.); (H.F.); (C.W.)
- Cancer Center of Loma Linda University, Loma Linda, CA 92350, USA
- Correspondence: (S.C.L.); (X.Z.); (J.F.Z.)
| |
Collapse
|
25
|
Javadzadeh S, Rajkumar U, Nguyen N, Sarmashghi S, Luebeck J, Shang J, Bafna V. FastViFi: Fast and accurate detection of (Hybrid) Viral DNA and RNA. NAR Genom Bioinform 2022; 4:lqac032. [PMID: 35493723 PMCID: PMC9041341 DOI: 10.1093/nargab/lqac032] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/10/2021] [Revised: 03/04/2022] [Accepted: 03/06/2022] [Indexed: 11/13/2022] Open
Abstract
DNA viruses are important infectious agents known to mediate a large number of human diseases, including cancer. Viral integration into the host genome and the formation of hybrid transcripts are also associated with increased pathogenicity. The high variability of viral genomes, however requires the use of sensitive ensemble hidden Markov models that add to the computational complexity, often requiring > 40 CPU-hours per sample. Here, we describe FastViFi, a fast 2-stage filtering method that reduces the computational burden. On simulated and cancer genomic data, FastViFi improved the running time by 2 orders of magnitude with comparable accuracy on challenging data sets. Recently published methods have focused on identification of location of viral integration into the human host genome using local assembly, but do not extend to RNA. To identify human viral hybrid transcripts, we additionally developed ensemble Hidden Markov Models for the Epstein Barr virus (EBV) to add to the models for Hepatitis B (HBV), Hepatitis C (HCV) viruses and the Human Papillomavirus (HPV), and used FastViFi to query RNA-seq data from Gastric cancer (EBV) and liver cancer (HBV/HCV). FastViFi ran in <10 minutes per sample and identified multiple hybrids that fuse viral and human genes suggesting new mechanisms for oncoviral pathogenicity. FastViFi is available at https://github.com/sara-javadzadeh/FastViFi.
Collapse
Affiliation(s)
- Sara Javadzadeh
- Department of Computer Science & Engineering, UC San Diego, La Jolla, California, USA
| | - Utkrisht Rajkumar
- Department of Computer Science & Engineering, UC San Diego, La Jolla, California, USA
| | - Nam Nguyen
- Boundless Bio, Inc. 11099 N Torrey Pines Rd, La Jolla, CA, USA
| | - Shahab Sarmashghi
- Department of Electrical and Computer Engineering, UC San Diego, La Jolla, California, USA
| | - Jens Luebeck
- Bioinformatics & Systems Biology Graduate Program, UC San Diego, La Jolla, California, USA
| | - Jingbo Shang
- Department of Computer Science & Engineering, UC San Diego, La Jolla, California, USA
| | - Vineet Bafna
- Department of Computer Science & Engineering, UC San Diego, La Jolla, California, USA
- Boundless Bio, Inc. 11099 N Torrey Pines Rd, La Jolla, CA, USA
- Moores Cancer Center, UC San Diego, La Jolla, California, USA
| |
Collapse
|
26
|
Paula DP, Barros SKA, Pitta RM, Barreto MR, Togawa RC, Andow DA. Metabarcoding versus mapping unassembled shotgun reads for identification of prey consumed by arthropod epigeal predators. Gigascience 2022; 11:6554098. [PMID: 35333301 PMCID: PMC8952265 DOI: 10.1093/gigascience/giac020] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/24/2021] [Revised: 12/07/2021] [Accepted: 02/09/2022] [Indexed: 12/19/2022] Open
Abstract
Background A central challenge of DNA gut content analysis is to identify prey in a highly degraded DNA community. In this study, we evaluated prey detection using metabarcoding and a method of mapping unassembled shotgun reads (Lazaro). Results In a mock prey community, metabarcoding did not detect any prey, probably owing to primer choice and/or preferential predator DNA amplification, while Lazaro detected prey with accuracy 43–71%. Gut content analysis of field-collected arthropod epigeal predators (3 ants, 1 dermapteran, and 1 carabid) from agricultural habitats in Brazil (27 samples, 46–273 individuals per sample) revealed that 64% of the prey species detections by either method were not confirmed by melting curve analysis and 87% of the true prey were detected in common. We hypothesized that Lazaro would detect fewer true- and false-positive and more false-negative prey with greater taxonomic resolution than metabarcoding but found that the methods were similar in sensitivity, specificity, false discovery rate, false omission rate, and accuracy. There was a positive correlation between the relative prey DNA concentration in the samples and the number of prey reads detected by Lazaro, while this was inconsistent for metabarcoding. Conclusions Metabarcoding and Lazaro had similar, but partially complementary, detection of prey in arthropod predator guts. However, while Lazaro was almost 2× more expensive, the number of reads was related to the amount of prey DNA, suggesting that Lazaro may provide quantitative prey information while metabarcoding did not.
Collapse
Affiliation(s)
- Débora Pires Paula
- Embrapa Genetic Resources and Biotechnology, Brasília-DF, 70770-917, Brazil
| | | | | | | | | | - David A Andow
- Department of Entomology, University of Minnesota, MN, 55108, St. Paul, USA
| |
Collapse
|
27
|
Van Dam AR, Covas Orizondo JO, Lam AW, McKenna DD, Van Dam MH. Metagenomic clustering reveals microbial contamination as an essential consideration in ultraconserved element design for phylogenomics with insect museum specimens. Ecol Evol 2022; 12:e8625. [PMID: 35342556 PMCID: PMC8932080 DOI: 10.1002/ece3.8625] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/26/2021] [Revised: 01/03/2022] [Accepted: 01/17/2022] [Indexed: 11/30/2022] Open
Abstract
Phylogenomics via ultraconserved elements (UCEs) has led to improved phylogenetic reconstructions across the tree of life. However, inadvertently incorporating non‐targeted DNA into the UCE marker design will lead to misinformation being incorporated into subsequent analyses. To date, the effectiveness of basic metagenomic filtering strategies has not been assessed in arthropods. Designing markers from museum specimens requires careful consideration of methods due to the high levels of microbial contamination typically found in such specimens. We investigate if contaminant sequences are carried forward into a UCE marker set we developed from insect museum specimens using a standard bioinformatics pipeline. We find that the methods currently employed by most researchers do not exclude contamination from the final set of targets. Lastly, we highlight several paths forward for reducing contamination in UCE marker design.
Collapse
Affiliation(s)
- Alex R. Van Dam
- Department of Biology University of Puerto Rico Mayagüez Mayagüez Puerto Rico
| | | | - Athena W. Lam
- Department of Entomology California Academy of Sciences San Francisco California USA
| | - Duane D. McKenna
- Department of Biological Sciences University of Memphis Memphis Tennessee USA
- Center for Biodiversity Research University of Memphis Memphis Tennessee USA
| | - Matthew H. Van Dam
- Department of Entomology California Academy of Sciences San Francisco California USA
| |
Collapse
|
28
|
Blanca A, Harris RS, Koslicki D, Medvedev P. The Statistics of k-mers from a Sequence Undergoing a Simple Mutation Process Without Spurious Matches. J Comput Biol 2022; 29:155-168. [PMID: 35108101 DOI: 10.1089/cmb.2021.0431] [Citation(s) in RCA: 11] [Impact Index Per Article: 5.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/22/2023] Open
Abstract
k-mer-based methods are widely used in bioinformatics, but there are many gaps in our understanding of their statistical properties. Here, we consider the simple model where a sequence S (e.g., a genome or a read) undergoes a simple mutation process through which each nucleotide is mutated independently with some probability r, under the assumption that there are no spurious k-mer matches. How does this process affect the k-mers of S? We derive the expectation and variance of the number of mutated k-mers and of the number of islands (a maximal interval of mutated k-mers) and oceans (a maximal interval of nonmutated k-mers). We then derive hypothesis tests and confidence intervals (CIs) for r given an observed number of mutated k-mers, or, alternatively, given the Jaccard similarity (with or without MinHash). We demonstrate the usefulness of our results using a few select applications: obtaining a CI to supplement the Mash distance point estimate, filtering out reads during alignment by Minimap2, and rating long-read alignments to a de Bruijn graph by Jabba.
Collapse
Affiliation(s)
- Antonio Blanca
- Department of Computer Science and Engineering, The Pennsylvania State University, University Park, Pennsylvania, USA
| | - Robert S Harris
- Department of Biology, The Pennsylvania State University, University Park, Pennsylvania, USA
| | - David Koslicki
- Department of Computer Science and Engineering, The Pennsylvania State University, University Park, Pennsylvania, USA.,Department of Biology, The Pennsylvania State University, University Park, Pennsylvania, USA.,Huck Institutes of the Life Sciences, The Pennsylvania State University, University Park, Pennsylvania, USA
| | - Paul Medvedev
- Department of Computer Science and Engineering, The Pennsylvania State University, University Park, Pennsylvania, USA.,Huck Institutes of the Life Sciences, The Pennsylvania State University, University Park, Pennsylvania, USA.,Department of Biochemistry and Molecular Biology, The Pennsylvania State University, University Park, Pennsylvania, USA
| |
Collapse
|
29
|
Zhu Q, Mirarab S. Assembling a Reference Phylogenomic Tree of Bacteria and Archaea by Summarizing Many Gene Phylogenies. Methods Mol Biol 2022; 2569:137-165. [PMID: 36083447 DOI: 10.1007/978-1-0716-2691-7_7] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 05/24/2023]
Abstract
Phylogenomics is the inference of phylogenetic trees based on multiple marker genes sampled in the genomes of interest. An important challenge in phylogenomics is the potential incongruence among the evolutionary histories of individual genes, which can be widespread in microorganisms due to the prevalence of horizontal gene transfer. This protocol introduces the procedures for building a phylogenetic tree of a large number of microbial genomes using a broad sampling of marker genes that are representative of whole-genome evolution. The protocol highlights the use of a gene tree summary method, which can effectively reconstruct the species tree while accounting for the topological conflicts among individual gene trees. The pipeline described in this protocol is scalable to tens of thousands of genomes while retaining high accuracy. We discussed multiple software tools, libraries, and scripts to enable convenient adoption of the protocol. The protocol is suitable for microbiology and microbiome studies based on public genomes and metagenomic data.
Collapse
Affiliation(s)
- Qiyun Zhu
- Biodesign Center for Fundamental and Applied Microbiomics, Arizona State University, Tempe, AZ, USA.
- School of Life Sciences, Arizona State University, Tempe, AZ, USA.
| | - Siavash Mirarab
- Department of Electrical and Computer Engineering, University of California San Diego, San Diego, CA, USA
| |
Collapse
|
30
|
Sarmashghi S, Balaban M, Rachtman E, Touri B, Mirarab S, Bafna V. Estimating repeat spectra and genome length from low-coverage genome skims with RESPECT. PLoS Comput Biol 2021; 17:e1009449. [PMID: 34780468 PMCID: PMC8629397 DOI: 10.1371/journal.pcbi.1009449] [Citation(s) in RCA: 7] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/09/2021] [Revised: 11/29/2021] [Accepted: 09/13/2021] [Indexed: 01/26/2023] Open
Abstract
The cost of sequencing the genome is dropping at a much faster rate compared to assembling and finishing the genome. The use of lightly sampled genomes (genome-skims) could be transformative for genomic ecology, and results using k-mers have shown the advantage of this approach in identification and phylogenetic placement of eukaryotic species. Here, we revisit the basic question of estimating genomic parameters such as genome length, coverage, and repeat structure, focusing specifically on estimating the k-mer repeat spectrum. We show using a mix of theoretical and empirical analysis that there are fundamental limitations to estimating the k-mer spectra due to ill-conditioned systems, and that has implications for other genomic parameters. We get around this problem using a novel constrained optimization approach (Spline Linear Programming), where the constraints are learned empirically. On reads simulated at 1X coverage from 66 genomes, our method, REPeat SPECTra Estimation (RESPECT), had 2.2% error in length estimation compared to 27% error previously achieved. In shotgun sequenced read samples with contaminants, RESPECT length estimates had median error 4%, in contrast to other methods that had median error 80%. Together, the results suggest that low-pass genomic sequencing can yield reliable estimates of the length and repeat content of the genome. The RESPECT software will be publicly available at https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_shahab-2Dsarmashghi_RESPECT.git&d=DwIGAw&c=-35OiAkTchMrZOngvJPOeA&r=ZozViWvD1E8PorCkfwYKYQMVKFoEcqLFm4Tg49XnPcA&m=f-xS8GMHKckknkc7Xpp8FJYw_ltUwz5frOw1a5pJ81EpdTOK8xhbYmrN4ZxniM96&s=717o8hLR1JmHFpRPSWG6xdUQTikyUjicjkipjFsKG4w&e=. The cost of sequencing the genome is dropping at a much faster rate compared to assembling and finishing the genome. The use of lightly sampled genomes (genome skims) could be transformative for genomic ecology. Analyzing genome skims, mostly based on statistics of small oligomers, remains challenging, but recent results have shown the advantage of this approach for the identification and phylogenetic placement of eukaryotic species. In this paper, we present a method, RESPECT, to estimate genomic properties such as genome length and repetitiveness from low-coverage genome skims. We trained RESPECT using assembled genomes and tested it on low-coverage simulated and real reads. Benchmarking results reveal that RESPECT has excellent accuracy in estimating the genome length compared to other methods, and can provide critical information regarding the repeat structure of the genome.
Collapse
Affiliation(s)
- Shahab Sarmashghi
- Department of Electrical & Computer Engineering, University of California, San Diego, La Jolla, California, United States of America
| | - Metin Balaban
- Bioinformatics & Systems Biology Graduate Program, University of California, San Diego, La Jolla, California, United States of America
| | - Eleonora Rachtman
- Bioinformatics & Systems Biology Graduate Program, University of California, San Diego, La Jolla, California, United States of America
| | - Behrouz Touri
- Department of Electrical & Computer Engineering, University of California, San Diego, La Jolla, California, United States of America
| | - Siavash Mirarab
- Department of Electrical & Computer Engineering, University of California, San Diego, La Jolla, California, United States of America
| | - Vineet Bafna
- Department of Computer Science & Engineering, University of California, San Diego, La Jolla, California, United States of America
- * E-mail:
| |
Collapse
|
31
|
Costa L, Marques A, Buddenhagen C, Thomas WW, Huettel B, Schubert V, Dodsworth S, Houben A, Souza G, Pedrosa-Harand A. Aiming off the target: recycling target capture sequencing reads for investigating repetitive DNA. ANNALS OF BOTANY 2021; 128:835-848. [PMID: 34050647 PMCID: PMC8577205 DOI: 10.1093/aob/mcab063] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 03/02/2021] [Accepted: 05/26/2021] [Indexed: 05/28/2023]
Abstract
BACKGROUND AND AIMS With the advance of high-throughput sequencing, reduced-representation methods such as target capture sequencing (TCS) emerged as cost-efficient ways of gathering genomic information, particularly from coding regions. As the off-target reads from such sequencing are expected to be similar to genome skimming (GS), we assessed the quality of repeat characterization in plant genomes using these data. METHODS Repeat composition obtained from TCS datasets of five Rhynchospora (Cyperaceae) species were compared with GS data from the same taxa. In addition, a FISH probe was designed based on the most abundant satellite found in the TCS dataset of Rhynchospora cephalotes. Finally, repeat-based phylogenies of the five Rhynchospora species were constructed based on the GS and TCS datasets and the topologies were compared with a gene-alignment-based phylogenetic tree. KEY RESULTS All the major repetitive DNA families were identified in TCS, including repeats that showed abundances as low as 0.01 % in the GS data. Rank correlations between GS and TCS repeat abundances were moderately high (r = 0.58-0.85), increasing after filtering out the targeted loci from the raw TCS reads (r = 0.66-0.92). Repeat data obtained by TCS were also reliable in developing a cytogenetic probe of a new variant of the holocentromeric satellite Tyba. Repeat-based phylogenies from TCS data were congruent with those obtained from GS data and the gene-alignment tree. CONCLUSIONS Our results show that off-target TCS reads can be recycled to identify repeats for cyto- and phylogenomic investigations. Given the growing availability of TCS reads, driven by global phylogenomic projects, our strategy represents a way to recycle genomic data and contribute to a better characterization of plant biodiversity.
Collapse
Affiliation(s)
- Lucas Costa
- Laboratory of Plant Cytogenetics and Evolution, Department of Botany, Federal University of Pernambuco, Recife-PE, Brazil
| | - André Marques
- Max Planck Institute for Plant Breeding Research, Cologne, Germany
| | | | | | - Bruno Huettel
- Max Planck Genome Centre Cologne, Max Planck Institute for Plant Breeding Research, Cologne, Germany
| | - Veit Schubert
- Leibniz Institute of Plant Genetics and Crop Plant Research (IPK) Gatersleben, Seeland, Germany
| | | | - Andreas Houben
- Leibniz Institute of Plant Genetics and Crop Plant Research (IPK) Gatersleben, Seeland, Germany
| | - Gustavo Souza
- Laboratory of Plant Cytogenetics and Evolution, Department of Botany, Federal University of Pernambuco, Recife-PE, Brazil
| | - Andrea Pedrosa-Harand
- Laboratory of Plant Cytogenetics and Evolution, Department of Botany, Federal University of Pernambuco, Recife-PE, Brazil
| |
Collapse
|
32
|
Blanke M, Morgenstern B. App-SpaM: phylogenetic placement of short reads without sequence alignment. BIOINFORMATICS ADVANCES 2021; 1:vbab027. [PMID: 36700102 PMCID: PMC9710606 DOI: 10.1093/bioadv/vbab027] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 08/31/2021] [Revised: 09/27/2021] [Accepted: 10/11/2021] [Indexed: 01/28/2023]
Abstract
Motivation Phylogenetic placement is the task of placing a query sequence of unknown taxonomic origin into a given phylogenetic tree of a set of reference sequences. A major field of application of such methods is, for example, the taxonomic identification of reads in metabarcoding or metagenomic studies. Several approaches to phylogenetic placement have been proposed in recent years. The most accurate of them requires a multiple sequence alignment of the references as input. However, calculating multiple alignments is not only time-consuming but also limits the applicability of these approaches. Results Herein, we propose Alignment-free phylogenetic placement algorithm based on Spaced-word Matches (App-SpaM), an efficient algorithm for the phylogenetic placement of short sequencing reads on a tree of a set of reference sequences. App-SpaM produces results of high quality that are on a par with the best available approaches to phylogenetic placement, while our software is two orders of magnitude faster than these existing methods. Our approach neither requires a multiple alignment of the reference sequences nor alignments of the queries to the references. This enables App-SpaM to perform phylogenetic placement on a broad variety of datasets. Availability and implementation The source code of App-SpaM is freely available on Github at https://github.com/matthiasblanke/App-SpaM together with detailed instructions for installation and settings. App-SpaM is furthermore available as a Conda-package on the Bioconda channel. Contact matthias.blanke@biologie.uni-goettingen.de. Supplementary information Supplementary data are available at Bioinformatics Advances online.
Collapse
Affiliation(s)
- Matthias Blanke
- Department of Bioinformatics, Institute of Microbiology and Genetics, Georg-August-University Göttingen, Göttingen 37077, Germany,International Max Planck Research School for Genome Science, Göttingen 37077, Germany,To whom correspondence should be addressed.
| | - Burkhard Morgenstern
- Department of Bioinformatics, Institute of Microbiology and Genetics, Georg-August-University Göttingen, Göttingen 37077, Germany,Campus-Institute Data Science (CIDAS), Göttingen 37077, Germany
| |
Collapse
|
33
|
Oliveira MAS, Nunes T, Dos Santos MA, Ferreira Gomes D, Costa I, Van-Lume B, Marques Da Silva SS, Oliveira RS, Simon MF, Lima GSA, Gissi DS, Almeida CCDS, Souza G, Marques A. High-Throughput Genomic Data Reveal Complex Phylogenetic Relationships in Stylosanthes Sw (Leguminosae). Front Genet 2021; 12:727314. [PMID: 34630521 PMCID: PMC8495327 DOI: 10.3389/fgene.2021.727314] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/18/2021] [Accepted: 09/08/2021] [Indexed: 11/22/2022] Open
Abstract
Allopolyploidy is widely present across plant lineages. Though estimating the correct phylogenetic relationships and origin of allopolyploids may sometimes become a hard task. In the genus Stylosanthes Sw. (Leguminosae), an important legume crop, allopolyploidy is a key speciation force. This makes difficult adequate species recognition and breeding efforts on the genus. Based on comparative analysis of nine high-throughput sequencing (HTS) samples, including three allopolyploids (S. capitata Vogel cv. “Campo Grande,” S. capitata “RS024” and S. scabra Vogel) and six diploids (S. hamata Taub, S. viscosa (L.) Sw., S. macrocephala M. B. Ferreira and Sousa Costa, S. guianensis (Aubl.) Sw., S. pilosa M. B. Ferreira and Sousa Costa and S. seabrana B. L. Maass & 't Mannetje) we provide a working pipeline to identify organelle and nuclear genome signatures that allowed us to trace the origin and parental genome recognition of allopolyploids. First, organelle genomes were de novo assembled and used to identify maternal genome donors by alignment-based phylogenies and synteny analysis. Second, nuclear-derived reads were subjected to repetitive DNA identification with RepeatExplorer2. Identified repeats were compared based on abundance and presence on diploids in relation to allopolyploids by comparative repeat analysis. Third, reads were extracted and grouped based on the following groups: chloroplast, mitochondrial, satellite DNA, ribosomal DNA, repeat clustered- and total genomic reads. These sets of reads were then subjected to alignment and assembly free phylogenetic analyses and were compared to classical alignment-based phylogenetic methods. Comparative analysis of shared and unique satellite repeats also allowed the tracing of allopolyploid origin in Stylosanthes, especially those with high abundance such as the StyloSat1 in the Scabra complex. This satellite was in situ mapped in the proximal region of the chromosomes and made it possible to identify its previously proposed parents. Hence, with simple genome skimming data we were able to provide evidence for the recognition of parental genomes and understand genome evolution of two Stylosanthes allopolyploids.
Collapse
Affiliation(s)
| | - Tomáz Nunes
- Laboratory of Genetic Resources, Federal University of Alagoas, Arapiraca, Brazil
| | | | | | - Iara Costa
- Laboratory of Genetic Resources, Federal University of Alagoas, Arapiraca, Brazil
| | - Brena Van-Lume
- Laboratory of Plant Cytogenetics and Evolution, Federal University of Pernambuco, Recife, Brazil
| | | | - Ronaldo Simão Oliveira
- Campus Xique Xique, Federal Institute of Education, Science and Technology of Bahia, Xique-Xique, Brazil
| | | | - Gaus S A Lima
- Center of Agronomic Sciences, Federal University of Alagoas, Rio Largo, Brazil
| | - Danilo Soares Gissi
- Department of Biostatistics, Institute of Biosciences-IBB, Plant Biology, Parasitology and Zoology, São Paulo State University-UNESP, Botucatu, Brazil
| | | | - Gustavo Souza
- Laboratory of Plant Cytogenetics and Evolution, Federal University of Pernambuco, Recife, Brazil
| | - André Marques
- Laboratory of Genetic Resources, Federal University of Alagoas, Arapiraca, Brazil.,Department of Chromosome Biology, Max Planck Institute for Plant Breeding Research, Cologne, Germany
| |
Collapse
|
34
|
Chafin TK, Regmi B, Douglas MR, Edds DR, Wangchuk K, Dorji S, Norbu P, Norbu S, Changlu C, Khanal GP, Tshering S, Douglas ME. Parallel introgression, not recurrent emergence, explains apparent elevational ecotypes of polyploid Himalayan snowtrout. ROYAL SOCIETY OPEN SCIENCE 2021; 8:210727. [PMID: 34729207 PMCID: PMC8548808 DOI: 10.1098/rsos.210727] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 04/30/2021] [Accepted: 10/01/2021] [Indexed: 06/13/2023]
Abstract
The recurrence of similar evolutionary patterns within different habitats often reflects parallel selective pressures acting upon either standing or independently occurring genetic variation to produce a convergence of phenotypes. This interpretation (i.e. parallel divergences within adjacent streams) has been hypothesized for drainage-specific morphological 'ecotypes' observed in polyploid snowtrout (Cyprinidae: Schizothorax). However, parallel patterns of differential introgression during secondary contact are a viable alternative hypothesis. Here, we used ddRADseq (N = 35 319 de novo and N = 10 884 transcriptome-aligned SNPs), as derived from Nepali/Bhutanese samples (N = 48 each), to test these competing hypotheses. We first employed genome-wide allelic depths to derive appropriate ploidy models, then a Bayesian approach to yield genotypes statistically consistent under the inferred expectations. Elevational 'ecotypes' were consistent in geometric morphometric space, but with phylogenetic relationships at the drainage level, sustaining a hypothesis of independent emergence. However, partitioned analyses of phylogeny and admixture identified subsets of loci under selection that retained genealogical concordance with morphology, suggesting instead that apparent patterns of morphological/phylogenetic discordance are driven by widespread genomic homogenization. Here, admixture occurring in secondary contact effectively 'masks' previous isolation. Our results underscore two salient factors: (i) morphological adaptations are retained despite hybridization and (ii) the degree of admixture varies across tributaries, presumably concomitant with underlying environmental or anthropogenic factors.
Collapse
Affiliation(s)
- Tyler K. Chafin
- Department of Biological Sciences, University of Arkansas, Fayetteville, AR 72701, USA
- Department of Ecology and Evolutionary Biology, University of Colorado, Boulder 80309, USA
| | - Binod Regmi
- Department of Biological Sciences, University of Arkansas, Fayetteville, AR 72701, USA
- National Institute of Arthritis, Musculoskeletal and Skin Diseases (NIAMS), National Institutes of Health, Bethesda, MD 20892, USA
| | - Marlis R. Douglas
- Department of Biological Sciences, University of Arkansas, Fayetteville, AR 72701, USA
| | - David R. Edds
- Department of Biological Sciences, Emporia State University, Emporia, KS 66801, USA
| | - Karma Wangchuk
- Department of Biological Sciences, University of Arkansas, Fayetteville, AR 72701, USA
- National Research and Development Centre for Riverine and Lake Fisheries, Ministry of Agriculture and Forests, Royal Government of Bhutan, Haa, Bhutan
| | - Sonam Dorji
- National Research and Development Centre for Riverine and Lake Fisheries, Ministry of Agriculture and Forests, Royal Government of Bhutan, Haa, Bhutan
| | - Pema Norbu
- National Research and Development Centre for Riverine and Lake Fisheries, Ministry of Agriculture and Forests, Royal Government of Bhutan, Haa, Bhutan
| | - Sangay Norbu
- National Research and Development Centre for Riverine and Lake Fisheries, Ministry of Agriculture and Forests, Royal Government of Bhutan, Haa, Bhutan
| | - Changlu Changlu
- National Research and Development Centre for Riverine and Lake Fisheries, Ministry of Agriculture and Forests, Royal Government of Bhutan, Haa, Bhutan
| | - Gopal Prasad Khanal
- National Research and Development Centre for Riverine and Lake Fisheries, Ministry of Agriculture and Forests, Royal Government of Bhutan, Haa, Bhutan
| | - Singye Tshering
- National Research and Development Centre for Riverine and Lake Fisheries, Ministry of Agriculture and Forests, Royal Government of Bhutan, Haa, Bhutan
| | - Michael E. Douglas
- Department of Biological Sciences, University of Arkansas, Fayetteville, AR 72701, USA
| |
Collapse
|
35
|
Agrawal N, Gupta M, Atri C, Akhatar J, Kumar S, Heslop-Harrison PJS, Banga SS. Anchoring alien chromosome segment substitutions bearing gene(s) for resistance to mustard aphid in Brassica juncea-B. fruticulosa introgression lines and their possible disruption through gamma irradiation. TAG. THEORETICAL AND APPLIED GENETICS. THEORETISCHE UND ANGEWANDTE GENETIK 2021; 134:3209-3224. [PMID: 34160642 DOI: 10.1007/s00122-021-03886-z] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 10/08/2020] [Accepted: 06/08/2021] [Indexed: 05/18/2023]
Abstract
KEY MESSAGE Heavy doses of gamma irradiation can reduce linkage drag by disrupting large sized alien translocations and promoting exchanges between crop and wild genomes. Resistance to mustard aphid (Lipaphis erysimi) infestation was significantly improved in Brassica juncea through B. juncea-B. fruticulosa introgression. However, linkage drag caused by introgressed chromatin fragments has so far prevented the deployment of this resistance source in commercial cultivars. We investigated the patterns of donor chromatin segment substitutions in the introgression lines (ILs) through genomic in situ hybridization (GISH) coupled with B. juncea chromosome-specific oligonucleotide probes. These allowed identification of large chromosome translocations from B. fruticulosa in the terminal regions of chromosomes A05, B02, B03 and B04 in three founder ILs (AD-64, 101 and 104). Only AD-101 carried an additional translocation at the sub-terminal to intercalary position in both homologues of chromosome A01. We validated these translocations with a reciprocal blast hit analysis using shotgun sequencing of three ILs and species-specific contigs/scaffolds (kb sized) from a de novo assembly of B. fruticulosa. Alien segment substitution on chromosome A05 could not be validated. Current studies also endeavoured to break linkage drag by exposing seeds to a heavy dose (200kR) of gamma radiation. Reduction in the size of introgressed chromatin fragments was observed in many M3 plants. There was a complete loss of the alien chromosome fragment in one instance. A few M3 plants with novel patterns of chromosome segment substitutions displayed improved agronomic performance coupled with resistance to mustard aphid. SNPs in such genomic spaces should aid the development of markers to track introgressed DNA and allow application in plant breeding.
Collapse
Affiliation(s)
- Neha Agrawal
- Department of Plant Breeding and Genetics, Punjab Agricultural University, Ludhiana, Punjab, 141004, India
| | - Mehak Gupta
- Department of Plant Breeding and Genetics, Punjab Agricultural University, Ludhiana, Punjab, 141004, India
| | - Chhaya Atri
- Department of Plant Breeding and Genetics, Punjab Agricultural University, Ludhiana, Punjab, 141004, India
| | - Javed Akhatar
- Department of Plant Breeding and Genetics, Punjab Agricultural University, Ludhiana, Punjab, 141004, India
| | - Sarwan Kumar
- Department of Plant Breeding and Genetics, Punjab Agricultural University, Ludhiana, Punjab, 141004, India
| | | | - Surinder S Banga
- Department of Plant Breeding and Genetics, Punjab Agricultural University, Ludhiana, Punjab, 141004, India.
| |
Collapse
|
36
|
Rachtman E, Bafna V, Mirarab S. CONSULT: accurate contamination removal using locality-sensitive hashing. NAR Genom Bioinform 2021; 3:lqab071. [PMID: 34377979 PMCID: PMC8340999 DOI: 10.1093/nargab/lqab071] [Citation(s) in RCA: 12] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/10/2021] [Revised: 06/30/2021] [Accepted: 07/19/2021] [Indexed: 12/27/2022] Open
Abstract
A fundamental question appears in many bioinformatics applications: Does a sequencing read belong to a large dataset of genomes from some broad taxonomic group, even when the closest match in the set is evolutionarily divergent from the query? For example, low-coverage genome sequencing (skimming) projects either assemble the organelle genome or compute genomic distances directly from unassembled reads. Using unassembled reads needs contamination detection because samples often include reads from unintended groups of species. Similarly, assembling the organelle genome needs distinguishing organelle and nuclear reads. While k-mer-based methods have shown promise in read-matching, prior studies have shown that existing methods are insufficiently sensitive for contamination detection. Here, we introduce a new read-matching tool called CONSULT that tests whether k-mers from a query fall within a user-specified distance of the reference dataset using locality-sensitive hashing. Taking advantage of large memory machines available nowadays, CONSULT libraries accommodate tens of thousands of microbial species. Our results show that CONSULT has higher true-positive and lower false-positive rates of contamination detection than leading methods such as Kraken-II and improves distance calculation from genome skims. We also demonstrate that CONSULT can distinguish organelle reads from nuclear reads, leading to dramatic improvements in skim-based mitochondrial assemblies.
Collapse
Affiliation(s)
- Eleonora Rachtman
- Bioinformatics and Systems Biology Graduate Program, UC San Diego, CA 92093, USA
| | - Vineet Bafna
- Department of Computer Science and Engineering, UC San Diego, CA 92093, USA
| | - Siavash Mirarab
- Department of Electrical and Computer Engineering, UC San Diego, CA 92093, USA
| |
Collapse
|
37
|
Lu YY, Bai J, Wang Y, Wang Y, Sun F. CRAFT: Compact genome Representation toward large-scale Alignment-Free daTabase. Bioinformatics 2021; 37:155-161. [PMID: 32766810 DOI: 10.1093/bioinformatics/btaa699] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/18/2019] [Revised: 03/11/2020] [Accepted: 07/28/2020] [Indexed: 01/02/2023] Open
Abstract
MOTIVATION Rapid developments in sequencing technologies have boosted generating high volumes of sequence data. To archive and analyze those data, one primary step is sequence comparison. Alignment-free sequence comparison based on k-mer frequencies offers a computationally efficient solution, yet in practice, the k-mer frequency vectors for large k of practical interest lead to excessive memory and storage consumption. RESULTS We report CRAFT, a general genomic/metagenomic search engine to learn compact representations of sequences and perform fast comparison between DNA sequences. Specifically, given genome or high throughput sequencing data as input, CRAFT maps the data into a much smaller embedding space and locates the best matching genome in the archived massive sequence repositories. With 102-104-fold reduction of storage space, CRAFT performs fast query for gigabytes of data within seconds or minutes, achieving comparable performance as six state-of-the-art alignment-free measures. AVAILABILITY AND IMPLEMENTATION CRAFT offers a user-friendly graphical user interface with one-click installation on Windows and Linux operating systems, freely available at https://github.com/jiaxingbai/CRAFT. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Yang Young Lu
- Quantitative and Computational Biology Program, Department of Biological Sciences, University of Southern California, Los Angeles, CA 90089, USA
| | - Jiaxing Bai
- Department of Automation, Xiamen University, Xiamen 361000, China
| | - Yiwen Wang
- Department of Automation, Xiamen University, Xiamen 361000, China
| | - Ying Wang
- Department of Automation, Xiamen University, Xiamen 361000, China.,Xiamen Key Lab. of Big Data Intelligent Analysis and Decision, Xiamen 361000, China
| | - Fengzhu Sun
- Quantitative and Computational Biology Program, Department of Biological Sciences, University of Southern California, Los Angeles, CA 90089, USA
| |
Collapse
|
38
|
Sequence Comparison Without Alignment: The SpaM Approaches. METHODS IN MOLECULAR BIOLOGY (CLIFTON, N.J.) 2021; 2231:121-134. [PMID: 33289890 DOI: 10.1007/978-1-0716-1036-7_8] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Subscribe] [Scholar Register] [Indexed: 10/22/2022]
Abstract
Sequence alignment is at the heart of DNA and protein sequence analysis. For the data volumes that are nowadays produced by massively parallel sequencing technologies, however, pairwise and multiple alignment methods are often too slow. Therefore, fast alignment-free approaches to sequence comparison have become popular in recent years. Most of these approaches are based on word frequencies, for words of a fixed length, or on word-matching statistics. Other approaches are using the length of maximal word matches. While these methods are very fast, most of them rely on ad hoc measures of sequences similarity or dissimilarity that are hard to interpret. In this chapter, I describe a number of alignment-free methods that we developed in recent years. Our approaches are based on spaced-word matches ("SpaM"), i.e. on inexact word matches, that are allowed to contain mismatches at certain pre-defined positions. Unlike most previous alignment-free approaches, our approaches are able to accurately estimate phylogenetic distances between DNA or protein sequences using a stochastic model of molecular evolution.
Collapse
|
39
|
Girgis HZ, James BT, Luczak BB. Identity: rapid alignment-free prediction of sequence alignment identity scores using self-supervised general linear models. NAR Genom Bioinform 2021; 3:lqab001. [PMID: 33554117 PMCID: PMC7850047 DOI: 10.1093/nargab/lqab001] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/14/2019] [Revised: 12/07/2020] [Accepted: 01/08/2021] [Indexed: 11/12/2022] Open
Abstract
Pairwise global alignment is a fundamental step in sequence analysis. Optimal alignment algorithms are quadratic-slow especially on long sequences. In many applications that involve large sequence datasets, all what is needed is calculating the identity scores (percentage of identical nucleotides in an optimal alignment-including gaps-of two sequences); there is no need for visualizing how every two sequences are aligned. For these applications, we propose Identity, which produces global identity scores for a large number of pairs of DNA sequences using alignment-free methods and self-supervised general linear models. For the first time, the new tool can predict pairwise identity scores in linear time and space. On two large-scale sequence databases, Identity provided the best compromise between sensitivity and precision while being faster than BLAST, Mash, MUMmer4 and USEARCH by 2-80 times. Identity was the best performing tool when searching for low-identity matches. While constructing phylogenetic trees from about 6000 transcripts, the tree due to the scores reported by Identity was the closest to the reference tree (in contrast to andi, FSWM and Mash). Identity is capable of producing pairwise identity scores of millions-of-nucleotides-long bacterial genomes; this task cannot be accomplished by any global-alignment-based tool. Availability: https://github.com/BioinformaticsToolsmith/Identity.
Collapse
Affiliation(s)
- Hani Z Girgis
- Bioinformatics Toolsmith Laboratory, Department of Electrical Engineering and Computer Science, Texas A&M University-Kingsville, 700 University Boulevard, Kingsville, TX 78363, USA
| | - Benjamin T James
- Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, 32 Vassar Street, Cambridge, MA 02139, USA
| | - Brian B Luczak
- Department of Mathematics, Vanderbilt University, 1326 Stevenson Center Lane, Nashville, TN 3721, USA
| |
Collapse
|
40
|
Criscuolo A. On the transformation of MinHash-based uncorrected distances into proper evolutionary distances for phylogenetic inference. F1000Res 2020; 9:1309. [PMID: 33335719 PMCID: PMC7713896 DOI: 10.12688/f1000research.26930.1] [Citation(s) in RCA: 13] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Accepted: 10/12/2020] [Indexed: 12/29/2022] Open
Abstract
Recently developed MinHash-based techniques were proven successful in quickly estimating the level of similarity between large nucleotide sequences. This article discusses their usage and limitations in practice to approximating uncorrected distances between genomes, and transforming these pairwise dissimilarities into proper evolutionary distances. It is notably shown that complex distance measures can be easily approximated using simple transformation formulae based on few parameters. MinHash-based techniques can therefore be very useful for implementing fast yet accurate alignment-free phylogenetic reconstruction procedures from large sets of genomes. This last point of view is assessed with a simulation study using a dedicated bioinformatics tool.
Collapse
Affiliation(s)
- Alexis Criscuolo
- Hub de Bioinformatique et Biostatistique - Département Biologie Computationnelle, Institut Pasteur, USR 3756, CNRS, 75015 Paris, France
| |
Collapse
|
41
|
Klötzl F, Haubold B. Phylonium: fast estimation of evolutionary distances from large samples of similar genomes. Bioinformatics 2020; 36:2040-2046. [PMID: 31790149 PMCID: PMC7141870 DOI: 10.1093/bioinformatics/btz903] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/06/2019] [Revised: 11/01/2019] [Accepted: 11/28/2019] [Indexed: 11/13/2022] Open
Abstract
Motivation Tracking disease outbreaks by whole-genome sequencing leads to the collection of large samples of closely related sequences. Five years ago, we published a method to accurately compute all pairwise distances for such samples by indexing each sequence. Since indexing is slow, we now ask whether it is possible to achieve similar accuracy when indexing only a single sequence. Results We have implemented this idea in the program phylonium and show that it is as accurate as its predecessor and roughly 100 times faster when applied to all 2678 Escherichia coli genomes contained in ENSEMBL. One of the best published programs for rapidly computing pairwise distances, mash, analyzes the same dataset four times faster but, with default settings, it is less accurate than phylonium. Availability and implementation Phylonium runs under the UNIX command line; its C++ sources and documentation are available from github.com/evolbioinf/phylonium. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Fabian Klötzl
- Department of Evolutionary Genetics, Max-Planck-Institute for Evolutionary Biology, Plön, Germany
| | - Bernhard Haubold
- Department of Evolutionary Genetics, Max-Planck-Institute for Evolutionary Biology, Plön, Germany
| |
Collapse
|
42
|
Baharav TZ, Kamath GM, Tse DN, Shomorony I. Spectral Jaccard Similarity: A New Approach to Estimating Pairwise Sequence Alignments. PATTERNS 2020; 1:100081. [PMID: 33205128 PMCID: PMC7660437 DOI: 10.1016/j.patter.2020.100081] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 03/26/2020] [Revised: 06/09/2020] [Accepted: 07/03/2020] [Indexed: 01/02/2023]
Abstract
Pairwise sequence alignment is often a computational bottleneck in genomic analysis pipelines, particularly in the context of third-generation sequencing technologies. To speed up this process, the pairwise k-mer Jaccard similarity is sometimes used as a proxy for alignment size in order to filter pairs of reads, and min-hashes are employed to efficiently estimate these similarities. However, when the k-mer distribution of a dataset is significantly non-uniform (e.g., due to GC biases and repeats), Jaccard similarity is no longer a good proxy for alignment size. In this work, we introduce a min-hash-based approach for estimating alignment sizes called Spectral Jaccard Similarity, which naturally accounts for uneven k-mer distributions. The Spectral Jaccard Similarity is computed by performing a singular value decomposition on a min-hash collision matrix. We empirically show that this new metric provides significantly better estimates for alignment sizes, and we provide a computationally efficient estimator for these spectral similarity scores.
Collapse
Affiliation(s)
- Tavor Z Baharav
- Department of Electrical Engineering, Stanford University, Stanford, CA 94305, USA
| | | | - David N Tse
- Department of Electrical Engineering, Stanford University, Stanford, CA 94305, USA
| | - Ilan Shomorony
- Department of Electrical and Computer Engineering, University of Illinois, Urbana-Champaign, IL 61801, USA
| |
Collapse
|
43
|
Wang Y, Chen Q, Deng C, Zheng Y, Sun F. KmerGO: A Tool to Identify Group-Specific Sequences With k-mers. Front Microbiol 2020; 11:2067. [PMID: 32983048 PMCID: PMC7477287 DOI: 10.3389/fmicb.2020.02067] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/07/2020] [Accepted: 08/06/2020] [Indexed: 01/24/2023] Open
Abstract
Capturing group-specific sequences between two groups of genomic/metagenomic sequences is critical for the follow-up identifications of singular nucleotide variants (SNVs), gene families, microbial species or other elements associated with each group. A sequence that is present, or rich, in one group, but absent, or scarce, in another group is considered a “group-specific” sequence in our study. We developed a user-friendly tool, KmerGO, to identify group-specific sequences between two groups of genomic/metagenomic long sequences or high-throughput sequencing datasets. Compared with other tools, KmerGO captures group-specific k-mers (k up to 40 bps) with much lower requirements for computing resources in much shorter running time. For a 1.05 TB dataset (.fasta), it takes KmerGO about 21.5 h (including k-mer counting) to return assembled group-specific sequences on a regular stand-alone workstation with no more than 1 GB memory. Furthermore, KmerGO can also be applied to capture trait-associated sequences for continuous-trait. Through multi-process parallel computing, KmerGO is implemented with both graphic user interface and command line on Linux and Windows free from any pre-installed supporting environments, packages, and complex configurations. The output group-specific k-mers or sequences from KmerGO could be the inputs of other tools for the downstream discovery of biomarkers, such as genetic variants, species, or genes. KmerGO is available at https://github.com/ChnMasterOG/KmerGO.
Collapse
Affiliation(s)
- Ying Wang
- Department of Automation, Xiamen University, Xiamen, China.,Xiamen Key Laboratory of Big Data Intelligent Analysis and Decision-Making, Xiamen, China
| | - Qi Chen
- Department of Automation, Xiamen University, Xiamen, China
| | - Chao Deng
- Department of Automation, Xiamen University, Xiamen, China
| | - Yiluan Zheng
- Department of Automation, Xiamen University, Xiamen, China
| | - Fengzhu Sun
- Quantitative and Computational Biology Program, Department of Biological Sciences, University of Southern California, Los Angeles, CA, United States
| |
Collapse
|
44
|
Bohmann K, Mirarab S, Bafna V, Gilbert MTP. Beyond DNA barcoding: The unrealized potential of genome skim data in sample identification. Mol Ecol 2020; 29:2521-2534. [PMID: 32542933 PMCID: PMC7496323 DOI: 10.1111/mec.15507] [Citation(s) in RCA: 33] [Impact Index Per Article: 8.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/26/2020] [Revised: 06/03/2020] [Accepted: 06/05/2020] [Indexed: 02/06/2023]
Abstract
Genetic tools are increasingly used to identify and discriminate between species. One key transition in this process was the recognition of the potential of the ca 658bp fragment of the organelle cytochrome c oxidase I (COI) as a barcode region, which revolutionized animal bioidentification and lead, among others, to the instigation of the Barcode of Life Database (BOLD), containing currently barcodes from >7.9 million specimens. Following this discovery, suggestions for other organellar regions and markers, and the primers with which to amplify them, have been continuously proposed. Most recently, the field has taken the leap from PCR-based generation of DNA references into shotgun sequencing-based "genome skimming" alternatives, with the ultimate goal of assembling organellar reference genomes. Unfortunately, in genome skimming approaches, much of the nuclear genome (as much as 99% of the sequence data) is discarded, which is not only wasteful, but can also limit the power of discrimination at, or below, the species level. Here, we advocate that the full shotgun sequence data can be used to assign an identity (that we term for convenience its "DNA-mark") for both voucher and query samples, without requiring any computationally intensive pretreatment (e.g. assembly) of reads. We argue that if reference databases are populated with such "DNA-marks," it will enable future DNA-based taxonomic identification to complement, or even replace PCR of barcodes with genome skimming, and we discuss how such methodology ultimately could enable identification to population, or even individual, level.
Collapse
Affiliation(s)
- Kristine Bohmann
- Section for Evolutionary GenomicsThe GLOBE InstituteUniversity of CopenhagenCopenhagenDenmark
| | - Siavash Mirarab
- Department of Electrical and Computer EngineeringUniversity of CaliforniaSan DiegoCAUSA
| | - Vineet Bafna
- Department of Computer Science and EngineeringUniversity of CaliforniaSan DiegoCAUSA
| | - M. Thomas P. Gilbert
- Section for Evolutionary GenomicsThe GLOBE InstituteUniversity of CopenhagenCopenhagenDenmark
- Center for Evolutionary HologenomicsThe GLOBE InstituteUniversity of CopenhagenCopenhagenDenmark
- NTNU University MuseumTrondheimNorway
| |
Collapse
|
45
|
Abstract
MOTIVATION Consider a simple computational problem. The inputs are (i) the set of mixed reads generated from a sample that combines two organisms and (ii) separate sets of reads for several reference genomes of known origins. The goal is to find the two organisms that constitute the mixed sample. When constituents are absent from the reference set, we seek to phylogenetically position them with respect to the underlying tree of the reference species. This simple yet fundamental problem (which we call phylogenetic double-placement) has enjoyed surprisingly little attention in the literature. As genome skimming (low-pass sequencing of genomes at low coverage, precluding assembly) becomes more prevalent, this problem finds wide-ranging applications in areas as varied as biodiversity research, food production and provenance, and evolutionary reconstruction. RESULTS We introduce a model that relates distances between a mixed sample and reference species to the distances between constituents and reference species. Our model is based on Jaccard indices computed between each sample represented as k-mer sets. The model, built on several assumptions and approximations, allows us to formalize the phylogenetic double-placement problem as a non-convex optimization problem that decomposes mixture distances and performs phylogenetic placement simultaneously. Using a variety of techniques, we are able to solve this optimization problem numerically. We test the resulting method, called MIxed Sample Analysis tool (MISA), on a varied set of simulated and biological datasets. Despite all the assumptions used, the method performs remarkably well in practice. AVAILABILITY AND IMPLEMENTATION The software and data are available at https://github.com/balabanmetin/misa and https://github.com/balabanmetin/misa-data. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Metin Balaban
- Bioinformatics and Systems Biology Department, University of California San Diego, San Diego, CA 92093, USA
| | - Siavash Mirarab
- Electrical and Computer Engineering Department, University of California San Diego, San Diego, CA 92093, USA
| |
Collapse
|
46
|
Chanda P, Costa E, Hu J, Sukumar S, Van Hemert J, Walia R. Information Theory in Computational Biology: Where We Stand Today. ENTROPY (BASEL, SWITZERLAND) 2020; 22:E627. [PMID: 33286399 PMCID: PMC7517167 DOI: 10.3390/e22060627] [Citation(s) in RCA: 22] [Impact Index Per Article: 5.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 04/30/2020] [Revised: 05/31/2020] [Accepted: 06/03/2020] [Indexed: 12/30/2022]
Abstract
"A Mathematical Theory of Communication" was published in 1948 by Claude Shannon to address the problems in the field of data compression and communication over (noisy) communication channels. Since then, the concepts and ideas developed in Shannon's work have formed the basis of information theory, a cornerstone of statistical learning and inference, and has been playing a key role in disciplines such as physics and thermodynamics, probability and statistics, computational sciences and biological sciences. In this article we review the basic information theory based concepts and describe their key applications in multiple major areas of research in computational biology-gene expression and transcriptomics, alignment-free sequence comparison, sequencing and error correction, genome-wide disease-gene association mapping, metabolic networks and metabolomics, and protein sequence, structure and interaction analysis.
Collapse
Affiliation(s)
- Pritam Chanda
- Corteva Agriscience™, Indianapolis, IN 46268, USA
- Computer and Information Science, Indiana University-Purdue University, Indianapolis, IN 46202, USA
| | - Eduardo Costa
- Corteva Agriscience™, Mogi Mirim, Sao Paulo 13801-540, Brazil
| | - Jie Hu
- Corteva Agriscience™, Indianapolis, IN 46268, USA
| | | | | | - Rasna Walia
- Corteva Agriscience™, Johnston, IA 50131, USA
| |
Collapse
|
47
|
Balaban M, Sarmashghi S, Mirarab S. APPLES: Scalable Distance-Based Phylogenetic Placement with or without Alignments. Syst Biol 2020; 69:566-578. [PMID: 31545363 PMCID: PMC7164367 DOI: 10.1093/sysbio/syz063] [Citation(s) in RCA: 40] [Impact Index Per Article: 10.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/26/2019] [Revised: 09/05/2019] [Accepted: 09/10/2019] [Indexed: 11/14/2022] Open
Abstract
Placing a new species on an existing phylogeny has increasing relevance to several applications. Placement can be used to update phylogenies in a scalable fashion and can help identify unknown query samples using (meta-)barcoding, skimming, or metagenomic data. Maximum likelihood (ML) methods of phylogenetic placement exist, but these methods are not scalable to reference trees with many thousands of leaves, limiting their ability to enjoy benefits of dense taxon sampling in modern reference libraries. They also rely on assembled sequences for the reference set and aligned sequences for the query. Thus, ML methods cannot analyze data sets where the reference consists of unassembled reads, a scenario relevant to emerging applications of genome skimming for sample identification. We introduce APPLES, a distance-based method for phylogenetic placement. Compared to ML, APPLES is an order of magnitude faster and more memory efficient, and unlike ML, it is able to place on large backbone trees (tested for up to 200,000 leaves). We show that using dense references improves accuracy substantially so that APPLES on dense trees is more accurate than ML on sparser trees, where it can run. Finally, APPLES can accurately identify samples without assembled reference or aligned queries using kmer-based distances, a scenario that ML cannot handle. APPLES is available publically at github.com/balabanmetin/apples.
Collapse
Affiliation(s)
- Metin Balaban
- Bioinformatics and Systems Biology Graduate Program, UC San Diego, CA 92093, USA
| | - Shahab Sarmashghi
- Department of Electrical and Computer Engineering, UC San Diego, CA 92093, USA
| | - Siavash Mirarab
- Department of Electrical and Computer Engineering, UC San Diego, CA 92093, USA
| |
Collapse
|
48
|
Dencker T, Leimeister CA, Gerth M, Bleidorn C, Snir S, Morgenstern B. 'Multi-SpaM': a maximum-likelihood approach to phylogeny reconstruction using multiple spaced-word matches and quartet trees. NAR Genom Bioinform 2020; 2:lqz013. [PMID: 33575565 PMCID: PMC7671388 DOI: 10.1093/nargab/lqz013] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/30/2019] [Revised: 07/31/2019] [Accepted: 10/13/2019] [Indexed: 02/03/2023] Open
Abstract
Word-based or 'alignment-free' methods for phylogeny inference have become popular in recent years. These methods are much faster than traditional, alignment-based approaches, but they are generally less accurate. Most alignment-free methods calculate 'pairwise' distances between nucleic-acid or protein sequences; these distance values can then be used as input for tree-reconstruction programs such as neighbor-joining. In this paper, we propose the first word-based phylogeny approach that is based on 'multiple' sequence comparison and 'maximum likelihood'. Our algorithm first samples small, gap-free alignments involving four taxa each. For each of these alignments, it then calculates a quartet tree and, finally, the program 'Quartet MaxCut' is used to infer a super tree for the full set of input taxa from the calculated quartet trees. Experimental results show that trees produced with our approach are of high quality.
Collapse
Affiliation(s)
- Thomas Dencker
- Department of Bioinformatics, Institute of Microbiology and Genetics, Universität Göttingen, Goldschmidtstr. 1, 37077 Göttingen, Germany
| | - Chris-André Leimeister
- Department of Bioinformatics, Institute of Microbiology and Genetics, Universität Göttingen, Goldschmidtstr. 1, 37077 Göttingen, Germany
| | - Michael Gerth
- Institute for Integrative Biology, University of Liverpool, Biosciences Building, Crown Street, L69 7ZB Liverpool, UK
| | - Christoph Bleidorn
- Department of Animal Evolution and Biodiversity, Universität Göttingen, Untere Karspüle 2, 37073 Göttingen, Germany
- Museo Nacional de Ciencias Naturales, Spanish National Research Council (CSIC), 28006 Madrid, Spain
| | - Sagi Snir
- Institute of Evolution, Department of Evolutionary and Environmental Biology, University of Haifa, 199 Aba Khoushy Ave. Mount Carmel, Haifa, Israel
| | - Burkhard Morgenstern
- Department of Bioinformatics, Institute of Microbiology and Genetics, Universität Göttingen, Goldschmidtstr. 1, 37077 Göttingen, Germany
- Göttingen Center of Molecular Biosciences (GZMB), Justus-von-Liebig-Weg 11, 37077 Göttingen, Germany
| |
Collapse
|
49
|
Röhling S, Linne A, Schellhorn J, Hosseini M, Dencker T, Morgenstern B. The number of k-mer matches between two DNA sequences as a function of k and applications to estimate phylogenetic distances. PLoS One 2020; 15:e0228070. [PMID: 32040534 PMCID: PMC7010260 DOI: 10.1371/journal.pone.0228070] [Citation(s) in RCA: 23] [Impact Index Per Article: 5.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/07/2020] [Accepted: 01/08/2020] [Indexed: 12/14/2022] Open
Abstract
We study the number Nk of length-k word matches between pairs of evolutionarily related DNA sequences, as a function of k. We show that the Jukes-Cantor distance between two genome sequences-i.e. the number of substitutions per site that occurred since they evolved from their last common ancestor-can be estimated from the slope of a function F that depends on Nk and that is affine-linear within a certain range of k. Integers kmin and kmax can be calculated depending on the length of the input sequences, such that the slope of F in the relevant range can be estimated from the values F(kmin) and F(kmax). This approach can be generalized to so-called Spaced-word Matches (SpaM), where mismatches are allowed at positions specified by a user-defined binary pattern. Based on these theoretical results, we implemented a prototype software program for alignment-free sequence comparison called Slope-SpaM. Test runs on real and simulated sequence data show that Slope-SpaM can accurately estimate phylogenetic distances for distances up to around 0.5 substitutions per position. The statistical stability of our results is improved if spaced words are used instead of contiguous words. Unlike previous alignment-free methods that are based on the number of (spaced) word matches, Slope-SpaM produces accurate results, even if sequences share only local homologies.
Collapse
Affiliation(s)
- Sophie Röhling
- University of Göttingen, Department of Bioinformatics, Göttingen, Germany
| | - Alexander Linne
- University of Göttingen, Department of Bioinformatics, Göttingen, Germany
| | - Jendrik Schellhorn
- University of Göttingen, Department of Bioinformatics, Göttingen, Germany
| | | | - Thomas Dencker
- University of Göttingen, Department of Bioinformatics, Göttingen, Germany
| | - Burkhard Morgenstern
- University of Göttingen, Department of Bioinformatics, Göttingen, Germany
- Göttingen Center of Molecular Biosciences (GZMB), Göttingen, Germany
| |
Collapse
|
50
|
Garrido-Sanz L, Senar MÀ, Piñol J. Estimation of the relative abundance of species in artificial mixtures of insects using low-coverage shotgun metagenomics. METABARCODING AND METAGENOMICS 2020. [DOI: 10.3897/mbmg.4.48281] [Citation(s) in RCA: 12] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/13/2022] Open
Abstract
Amplicon metabarcoding is an established technique to analyse the taxonomic composition of communities of organisms using high-throughput DNA sequencing, but there are doubts about its ability to quantify the relative proportions of the species, as opposed to the species list. Here, we bypass the enrichment step and avoid the PCR-bias, by directly sequencing the extracted DNA using shotgun metagenomics. This approach is common practice in prokaryotes, but not in eukaryotes, because of the low number of sequenced genomes of eukaryotic species. We tested the metagenomics approach using insect species whose genome is already sequenced and assembled to an advanced degree. We shotgun-sequenced, at low-coverage, 18 species of insects in 22 single-species and 6 mixed-species libraries and mapped the reads against 110 reference genomes of insects. We used the single-species libraries to calibrate the process of assignation of reads to species and the libraries created from species mixtures to evaluate the ability of the method to quantify the relative species abundance. Our results showed that the shotgun metagenomic method is easily able to set apart closely-related insect species, like four species of Drosophila included in the artificial libraries. However, to avoid the counting of rare misclassified reads in samples, it was necessary to use a rather stringent detection limit of 0.001, so species with a lower relative abundance are ignored. We also identified that approximately half the raw reads were informative for taxonomic purposes. Finally, using the mixed-species libraries, we showed that it was feasible to quantify with confidence the relative abundance of individual species in the mixtures.
Collapse
|