1
|
Liang C, Wang W, Chen J. First transcriptome assembly of a new ciliate species (Protocruzia marianaensis) isolated from the Mariana Trench area. Mar Genomics 2025; 79:101164. [PMID: 39855811 DOI: 10.1016/j.margen.2024.101164] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/05/2024] [Revised: 11/28/2024] [Accepted: 12/03/2024] [Indexed: 01/27/2025]
Abstract
This is the first report of a transcriptome assembly of a newly discovered a new Protocruzia species sampled from the under-sampled area near the Mariana Trench. We sequenced the transcriptome of P. marianaensis using the Illumina Novaseq 6000 platform. De novo assembly and analysis of the coding regions predicted 36,116 unigenes, 74.91 % of which was annotated by public databases. The transcriptome of P. marianaensis will be a valuable resource in studying the ecological and biological characteristics of this new species, which is the first Protocruzia species in deep sea. These data can also help to understand protozoa survival mechanisms in deep-sea habitats and provide essential biological material for investigating unique life phenomena and processes in the deep ocean.
Collapse
Affiliation(s)
- Chen Liang
- Fujian Key Laboratory on Conservation and Sustainable Utilization of Marine Biodiversity, Fuzhou Institute of Oceanography, College of Geography and Oceanography, Minjiang University, Fuzhou, 350108, China; Technology Innovation Center for Monitoring and Restoration Engineering of Ecological Fragile Zone in Southeast China, Ministry of Natural Resources, Fuzhou 350001, China; Key Laboratory of Marine Ecosystem Dynamics, Second Institute of Oceanography, Ministry of Natural Resources, Hangzhou 310012, China.
| | - Wei Wang
- Fujian Key Laboratory on Conservation and Sustainable Utilization of Marine Biodiversity, Fuzhou Institute of Oceanography, College of Geography and Oceanography, Minjiang University, Fuzhou, 350108, China
| | - Jianming Chen
- Fujian Key Laboratory on Conservation and Sustainable Utilization of Marine Biodiversity, Fuzhou Institute of Oceanography, College of Geography and Oceanography, Minjiang University, Fuzhou, 350108, China
| |
Collapse
|
2
|
Sommer M, Zimin A, Salzberg S. PSAURON: a tool for assessing protein annotation across a broad range of species. NAR Genom Bioinform 2025; 7:lqae189. [PMID: 39781514 PMCID: PMC11704789 DOI: 10.1093/nargab/lqae189] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/15/2024] [Revised: 12/10/2024] [Accepted: 12/23/2024] [Indexed: 01/12/2025] Open
Abstract
Evaluating the accuracy of protein-coding sequences in genome annotations is a challenging problem for which there is no broadly applicable solution. In this manuscript, we introduce PSAURON (Protein Sequence Assessment Using a Reference ORF Network), a novel software tool developed to help assess the quality of protein-coding gene annotations. Utilizing a machine learning model trained on a diverse dataset from over 1000 plant and animal genomes, PSAURON assigns a score to coding DNA or protein sequence that reflects the likelihood that the sequence is a genuine protein-coding region. PSAURON scores can be used for genome-wide protein annotation assessment as well as the rapid identification of potentially spurious annotated proteins. Validation against established benchmarks demonstrates PSAURON's effectiveness and correlation with recognized measures of protein quality, highlighting its potential use as a widely applicable method to evaluate precision in gene annotation. PSAURON is open source and freely available at https://github.com/salzberg-lab/PSAURON.
Collapse
Affiliation(s)
- Markus J Sommer
- Department of Biomedical Engineering, Johns Hopkins University, Baltimore, MD 21218, USA
- Center for Computational Biology, Johns Hopkins University, Baltimore, MD 21218, USA
| | - Aleksey V Zimin
- Department of Biomedical Engineering, Johns Hopkins University, Baltimore, MD 21218, USA
- Center for Computational Biology, Johns Hopkins University, Baltimore, MD 21218, USA
| | - Steven L Salzberg
- Department of Biomedical Engineering, Johns Hopkins University, Baltimore, MD 21218, USA
- Center for Computational Biology, Johns Hopkins University, Baltimore, MD 21218, USA
- Department of Computer Science, Johns Hopkins University, Baltimore, MD 21218, USA
- Department of Biostatistics, Johns Hopkins University, Baltimore, MD 21218, USA
| |
Collapse
|
3
|
Khan ZA, Sharma SK, Gupta N, Diksha D, Thapa P, Shimray MY, Prajapati MR, Nabi SU, Watpade S, Verma MK, Baranwal VK. Assessing the de novo assemblers: a metaviromic study of apple and first report of citrus concave gum-associated virus, apple rubbery wood virus 1 and 2 infecting apple in India. BMC Genomics 2024; 25:1057. [PMID: 39516740 PMCID: PMC11546112 DOI: 10.1186/s12864-024-10968-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/30/2024] [Accepted: 10/28/2024] [Indexed: 11/16/2024] Open
Abstract
BACKGROUND The choice of de novo assembler for high-throughput sequencing (HTS) data remains a pivotal factor in the HTS-based discovery of viral pathogens. This study assessed de novo assemblers, namely Trinity, SPAdes, and MEGAHIT for HTS datasets generated on the Illumina platform from 23 apple samples, representing 15 exotic and indigenous apple varieties and a rootstock. The assemblers were compared based on assembly quality metrics, including the largest contig, total assembly length, genome coverage, and N50. RESULTS MEGAHIT was most efficient assembler according to the metrics evaluated in this study. By using multiple assemblers, near-complete genome sequences of citrus concave gum-associated virus (CCGaV), apple rubbery wood virus 1 (ARWV-1), ARWV-2, apple necrotic mosaic virus (ApNMV), apple mosaic virus, apple stem pitting virus, apple stem grooving virus, apple chlorotic leaf spot virus, apple hammerhead viroid and apple scar skin viroid were reconstructed. These viruses were further confirmed through Sanger sequencing in different apple cultivars. Among them, CCGaV, ARWV-1 and ARWV-2 were recorded from apples in India for the first time. The analysis of virus richness revealed that ApNMV was dominant, followed by ARWV-1 and CCGaV. Moreover, MEGAHIT identified novel single-nucleotide variants. CONCLUSIONS Our analyses highlight the crucial role of assembly methods in reconstructing near-complete apple virus genomes from the Illumina reads. This study emphasizes the significance of employing multiple assemblers for de novo virus genome assembly in vegetatively propagated perennial fruit crops.
Collapse
Affiliation(s)
- Zainul A Khan
- Advanced Centre for Plant Virology, Division of Plant Pathology, ICAR-Indian Agricultural Research Institute, New Delhi, 110012, India
- Current Address: United States Department of Agriculture, Agricultural Research Service, Northern Crop Science Laboratory, Fargo, ND, 58102, USA
| | - Susheel Kumar Sharma
- Advanced Centre for Plant Virology, Division of Plant Pathology, ICAR-Indian Agricultural Research Institute, New Delhi, 110012, India.
| | - Nitika Gupta
- Advanced Centre for Plant Virology, Division of Plant Pathology, ICAR-Indian Agricultural Research Institute, New Delhi, 110012, India
| | - Damini Diksha
- Advanced Centre for Plant Virology, Division of Plant Pathology, ICAR-Indian Agricultural Research Institute, New Delhi, 110012, India
| | - Pooja Thapa
- Advanced Centre for Plant Virology, Division of Plant Pathology, ICAR-Indian Agricultural Research Institute, New Delhi, 110012, India
| | - Mailem Yazing Shimray
- Advanced Centre for Plant Virology, Division of Plant Pathology, ICAR-Indian Agricultural Research Institute, New Delhi, 110012, India
| | - Malyaj R Prajapati
- Advanced Centre for Plant Virology, Division of Plant Pathology, ICAR-Indian Agricultural Research Institute, New Delhi, 110012, India
| | - Sajad U Nabi
- ICAR-Central Institute of Temperate Horticulture, Srinagar, 191132, India
| | - Santosh Watpade
- ICAR-Indian Agricultural Research Institute, Regional Station, Shimla, Himachal Pradesh, 171004, India
| | - Mahendra K Verma
- ICAR-Central Institute of Temperate Horticulture, Srinagar, 191132, India
| | - Virendra K Baranwal
- Advanced Centre for Plant Virology, Division of Plant Pathology, ICAR-Indian Agricultural Research Institute, New Delhi, 110012, India.
| |
Collapse
|
4
|
Chen Q, Yang C, Zhang G, Wu D. GCI: a continuity inspector for complete genome assembly. Bioinformatics 2024; 40:btae633. [PMID: 39432569 PMCID: PMC11550331 DOI: 10.1093/bioinformatics/btae633] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/07/2024] [Revised: 10/08/2024] [Accepted: 10/18/2024] [Indexed: 10/23/2024] Open
Abstract
MOTIVATION Recent advances in long-read sequencing technologies have significantly facilitated the production of high-quality genome assembly. The telomere-to-telomere (T2T) gapless assembly has become the new golden standard of genome assembly efforts. Several recent efforts have claimed to produce T2T-level reference genomes. However, a universal standard is still missing to qualify a genome assembly to be at T2T standard. Traditional genome assembly assessment metrics (N50 and its derivatives) have no capacity in differentiating between nearly T2T assembly and the truly T2T assembly in continuity either globally or locally. Additionally, these metrics are independent of raw reads, making them inflated easily by artificial operations. Therefore, a gaplessness evaluation tool at single-nucleotide resolution to reflect true completeness is urgently needed in the era of complete genomes. RESULTS Here, we present a tool called Genome Continuity Inspector (GCI), designed to assess genome assembly continuity at single-base resolution, and evaluate how close an assembly is to the T2T level. GCI utilizes multiple aligners to map long reads from various sequencing platforms back to the assembly. By incorporating curated mapping coverage of high-confidence read alignments, GCI identifies potential assembly issues. Meanwhile, it provides GCI scores that quantify overall assembly continuity on the whole genome or chromosome scales. AVAILABILITY AND IMPLEMENTATION The open-source GCI code is freely available on Github (https://github.com/yeeus/GCI) under the MIT license.
Collapse
Affiliation(s)
- Quanyu Chen
- International Institutes of Medicine, The Fourth Affiliated Hospital, Zhejiang University School of Medicine, Yiwu 322000, China
- Center for Evolutionary & Organismal Biology, Liangzhu Laboratory, Zhejiang University Medical Center, Hangzhou 311121, China
- Chu Kochen Honors College, Zhejiang University, Hangzhou 310058, China
| | - Chentao Yang
- BGI Research, Shenzhen 518083, China
- BGI Research, Wuhan 430074, China
| | - Guojie Zhang
- International Institutes of Medicine, The Fourth Affiliated Hospital, Zhejiang University School of Medicine, Yiwu 322000, China
- Center for Evolutionary & Organismal Biology, Liangzhu Laboratory, Zhejiang University Medical Center, Hangzhou 311121, China
- Women’s Hospital, School of Medicine, Zhejiang University, Hangzhou 310006, China
| | - Dongya Wu
- Center for Evolutionary & Organismal Biology, Liangzhu Laboratory, Zhejiang University Medical Center, Hangzhou 311121, China
| |
Collapse
|
5
|
Henglin M, Ghareghani M, Harvey WT, Porubsky D, Koren S, Eichler EE, Ebert P, Marschall T. Graphasing: phasing diploid genome assembly graphs with single-cell strand sequencing. Genome Biol 2024; 25:265. [PMID: 39390579 PMCID: PMC11466045 DOI: 10.1186/s13059-024-03409-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/16/2024] [Accepted: 09/30/2024] [Indexed: 10/12/2024] Open
Abstract
Haplotype information is crucial for biomedical and population genetics research. However, current strategies to produce de novo haplotype-resolved assemblies often require either difficult-to-acquire parental data or an intermediate haplotype-collapsed assembly. Here, we present Graphasing, a workflow which synthesizes the global phase signal of Strand-seq with assembly graph topology to produce chromosome-scale de novo haplotypes for diploid genomes. Graphasing readily integrates with any assembly workflow that both outputs an assembly graph and has a haplotype assembly mode. Graphasing performs comparably to trio phasing in contiguity, phasing accuracy, and assembly quality, outperforms Hi-C in phasing accuracy, and generates human assemblies with over 18 chromosome-spanning haplotypes.
Collapse
Affiliation(s)
- Mir Henglin
- Institute for Medical Biometry and Bioinformatics, Medical Faculty and University Hospital Düsseldorf, Heinrich Heine University Düsseldorf, Düsseldorf, Germany
- Center for Digital Medicine, Heinrich Heine University Düsseldorf, Düsseldorf, Germany
| | - Maryam Ghareghani
- Department of Mathematics and Computer Science, Freie Universität Berlin, Berlin, Germany
- Department of Computational Molecular Biology, Max Planck Institute for Molecular Genetics, Berlin, Germany
| | - William T Harvey
- Department of Genome Sciences, University of Washington School of Medicine, Seattle, WA, USA
| | - David Porubsky
- Department of Genome Sciences, University of Washington School of Medicine, Seattle, WA, USA
| | - Sergey Koren
- Genome Informatics Section, Computational and Statistical Genomics Branch, National Human Genome Research Institute, National Institutes of Health, Bethesda, MD, USA
| | - Evan E Eichler
- Department of Genome Sciences, University of Washington School of Medicine, Seattle, WA, USA
- Howard Hughes Medical Institute, University of Washington, Seattle, WA, USA
| | - Peter Ebert
- Institute for Medical Biometry and Bioinformatics, Medical Faculty and University Hospital Düsseldorf, Heinrich Heine University Düsseldorf, Düsseldorf, Germany.
- Center for Digital Medicine, Heinrich Heine University Düsseldorf, Düsseldorf, Germany.
- Core Unit Bioinformatics, Medical Faculty and University Hospital Düsseldorf, Heinrich Heine University Düsseldorf, Düsseldorf, Germany.
| | - Tobias Marschall
- Institute for Medical Biometry and Bioinformatics, Medical Faculty and University Hospital Düsseldorf, Heinrich Heine University Düsseldorf, Düsseldorf, Germany.
- Center for Digital Medicine, Heinrich Heine University Düsseldorf, Düsseldorf, Germany.
| |
Collapse
|
6
|
Zhao H, Zhou H, Sun G, Dong B, Zhu W, Mu X, Li X, Wang J, Zhao M, Yang W, Zhang G, Ji R, Geng T, Gong D, Meng H, Wang J. Telomere-to-telomere genome assembly of the goose Anser cygnoides. Sci Data 2024; 11:741. [PMID: 38972874 PMCID: PMC11228014 DOI: 10.1038/s41597-024-03567-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/13/2023] [Accepted: 06/24/2024] [Indexed: 07/09/2024] Open
Abstract
Our study presents the assembly of a high-quality Taihu goose genome at the Telomere-to-Telomere (T2T) level. By employing advanced sequencing technologies, including Pacific Biosciences HiFi reads, Oxford Nanopore long reads, Illumina short reads, and chromatin conformation capture (Hi-C), we achieved an exceptional assembly. The T2T assembly encompasses a total length of 1,197,991,206 bp, with contigs N50 reaching 33,928,929 bp and scaffold N50 attaining 81,007,908 bp. It consists of 73 scaffolds, including 38 autosomes and one pair of Z/W sex chromosomes. Importantly, 33 autosomes were assembled without any gap, resulting in a contiguous representation. Furthermore, gene annotation efforts identified 34,898 genes, including 436,162 RNA transcripts, encompassing 806,158 exons, 743,910 introns, 651,148 coding sequences (CDS), and 135,622 untranslated regions (UTR). The T2T-level chromosome-scale goose genome assembly provides a vital foundation for future genetic improvement and understanding the genetic mechanisms underlying important traits in geese.
Collapse
Affiliation(s)
- Hongchang Zhao
- Jiangsu Agri-animal Husbandry Vocational College, Taizhou, 225300, China
- National Waterfowl of gene pool, Taizhou, 225511, China
| | - Hao Zhou
- Key Laboratory of Veterinary Biotechnology, School of Agriculture and Biology, Shanghai Jiao Tong University, Shanghai, 201100, China
- Yellow Sea Fisheries Research Institute, Chinese Academy of Fishery Sciences, Qingdao, 266071, China
| | - Guobo Sun
- Jiangsu Agri-animal Husbandry Vocational College, Taizhou, 225300, China
- National Waterfowl of gene pool, Taizhou, 225511, China
| | - Biao Dong
- Jiangsu Agri-animal Husbandry Vocational College, Taizhou, 225300, China
| | - Wenqi Zhu
- Key Laboratory of Veterinary Biotechnology, School of Agriculture and Biology, Shanghai Jiao Tong University, Shanghai, 201100, China
| | - Xiaohui Mu
- Jiangsu Agri-animal Husbandry Vocational College, Taizhou, 225300, China
- National Waterfowl of gene pool, Taizhou, 225511, China
| | - Xiaoming Li
- Jiangsu Agri-animal Husbandry Vocational College, Taizhou, 225300, China
- National Waterfowl of gene pool, Taizhou, 225511, China
| | - Jun Wang
- Jiangsu Agri-animal Husbandry Vocational College, Taizhou, 225300, China
- National Waterfowl of gene pool, Taizhou, 225511, China
| | - Mengli Zhao
- Jiangsu Agri-animal Husbandry Vocational College, Taizhou, 225300, China
- National Waterfowl of gene pool, Taizhou, 225511, China
| | - Wenhao Yang
- Key Laboratory of Veterinary Biotechnology, School of Agriculture and Biology, Shanghai Jiao Tong University, Shanghai, 201100, China
| | - Gansheng Zhang
- Jiangsu Agri-animal Husbandry Vocational College, Taizhou, 225300, China
- Taizhou Fengda Agriculture and Animal Husbandry Technology Co., Ltd, Taizhou, 225511, China
| | - Rongchao Ji
- National Waterfowl of gene pool, Taizhou, 225511, China
- Taizhou Fengda Agriculture and Animal Husbandry Technology Co., Ltd, Taizhou, 225511, China
| | - Tuoyu Geng
- College of Animal Science and Technology, Yangzhou University, Yangzhou, 225000, China
| | - Daoqing Gong
- College of Animal Science and Technology, Yangzhou University, Yangzhou, 225000, China.
| | - He Meng
- Key Laboratory of Veterinary Biotechnology, School of Agriculture and Biology, Shanghai Jiao Tong University, Shanghai, 201100, China.
| | - Jian Wang
- Jiangsu Agri-animal Husbandry Vocational College, Taizhou, 225300, China.
- National Waterfowl of gene pool, Taizhou, 225511, China.
| |
Collapse
|
7
|
Shaw J, Gounot JS, Chen H, Nagarajan N, Yu YW. Floria: fast and accurate strain haplotyping in metagenomes. Bioinformatics 2024; 40:i30-i38. [PMID: 38940183 PMCID: PMC11211831 DOI: 10.1093/bioinformatics/btae252] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/29/2024] Open
Abstract
SUMMARY Shotgun metagenomics allows for direct analysis of microbial community genetics, but scalable computational methods for the recovery of bacterial strain genomes from microbiomes remains a key challenge. We introduce Floria, a novel method designed for rapid and accurate recovery of strain haplotypes from short and long-read metagenome sequencing data, based on minimum error correction (MEC) read clustering and a strain-preserving network flow model. Floria can function as a standalone haplotyping method, outputting alleles and reads that co-occur on the same strain, as well as an end-to-end read-to-assembly pipeline (Floria-PL) for strain-level assembly. Benchmarking evaluations on synthetic metagenomes show that Floria is > 3× faster and recovers 21% more strain content than base-level assembly methods (Strainberry) while being over an order of magnitude faster when only phasing is required. Applying Floria to a set of 109 deeply sequenced nanopore metagenomes took <20 min on average per sample and identified several species that have consistent strain heterogeneity. Applying Floria's short-read haplotyping to a longitudinal gut metagenomics dataset revealed a dynamic multi-strain Anaerostipes hadrus community with frequent strain loss and emergence events over 636 days. With Floria, accurate haplotyping of metagenomic datasets takes mere minutes on standard workstations, paving the way for extensive strain-level metagenomic analyses. AVAILABILITY AND IMPLEMENTATION Floria is available at https://github.com/bluenote-1577/floria, and the Floria-PL pipeline is available at https://github.com/jsgounot/Floria_analysis_workflow along with code for reproducing the benchmarks.
Collapse
Affiliation(s)
- Jim Shaw
- Department of Mathematics, University of Toronto, Toronto, Ontario, M5S 2E4, Canada
| | - Jean-Sebastien Gounot
- Genome Institute of Singapore (GIS), Agency for Science, Technology and Research (A*STAR), 60 Biopolis Street, Singapore, 138672, Republic of Singapore
| | - Hanrong Chen
- Genome Institute of Singapore (GIS), Agency for Science, Technology and Research (A*STAR), 60 Biopolis Street, Singapore, 138672, Republic of Singapore
| | - Niranjan Nagarajan
- Genome Institute of Singapore (GIS), Agency for Science, Technology and Research (A*STAR), 60 Biopolis Street, Singapore, 138672, Republic of Singapore
- Yong Loo Lin School of Medicine, National University of Singapore, Singapore, 117597, Republic of Singapore
| | - Yun William Yu
- Department of Mathematics, University of Toronto, Toronto, Ontario, M5S 2E4, Canada
- Ray and Stephanie Lane Computational Biology Department, Carnegie Mellon University, Pittsburgh, PA, 15213, United States
| |
Collapse
|
8
|
Henglin M, Ghareghani M, Harvey W, Porubsky D, Koren S, Eichler EE, Ebert P, Marschall T. Phasing Diploid Genome Assembly Graphs with Single-Cell Strand Sequencing. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.02.15.580432. [PMID: 38529499 PMCID: PMC10962706 DOI: 10.1101/2024.02.15.580432] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 03/27/2024]
Abstract
Haplotype information is crucial for biomedical and population genetics research. However, current strategies to produce de-novo haplotype-resolved assemblies often require either difficult-to-acquire parental data or an intermediate haplotype-collapsed assembly. Here, we present Graphasing, a workflow which synthesizes the global phase signal of Strand-seq with assembly graph topology to produce chromosome-scale de-novo haplotypes for diploid genomes. Graphasing readily integrates with any assembly workflow that both outputs an assembly graph and has a haplotype assembly mode. Graphasing performs comparably to trio-phasing in contiguity, phasing accuracy, and assembly quality, outperforms Hi-C in phasing accuracy, and generates human assemblies with over 18 chromosome-spanning haplotypes.
Collapse
Affiliation(s)
- Mir Henglin
- Institute for Medical Biometry and Bioinformatics, Medical Faculty and University Hospital Düsseldorf, Heinrich Heine University Düsseldorf, Germany
- Center for Digital Medicine, Heinrich Heine University Düsseldorf, Germany
| | - Maryam Ghareghani
- Department of Mathematics and Computer Science, Freie Universität Berlin, Germany
- Department of Computational Molecular Biology, Max Planck Institute for Molecular Genetics, Berlin, Germany
| | - William Harvey
- Department of Genome Sciences, University of Washington School of Medicine, Seattle, WA, USA
| | - David Porubsky
- Department of Genome Sciences, University of Washington School of Medicine, Seattle, WA, USA
| | - Sergey Koren
- Genome Informatics Section, Computational and Statistical Genomics Branch, National Human Genome Research Institute, National Institutes of Health, Bethesda, MD, USA
| | - Evan E. Eichler
- Department of Genome Sciences, University of Washington School of Medicine, Seattle, WA, USA
- Howard Hughes Medical Institute, University of Washington, Seattle, WA, USA
| | - Peter Ebert
- Institute for Medical Biometry and Bioinformatics, Medical Faculty and University Hospital Düsseldorf, Heinrich Heine University Düsseldorf, Germany
- Center for Digital Medicine, Heinrich Heine University Düsseldorf, Germany
- Core Unit Bioinformatics, Medical Faculty and University Hospital Düsseldorf, Heinrich Heine University Düsseldorf, Germany
| | - Tobias Marschall
- Institute for Medical Biometry and Bioinformatics, Medical Faculty and University Hospital Düsseldorf, Heinrich Heine University Düsseldorf, Germany
- Center for Digital Medicine, Heinrich Heine University Düsseldorf, Germany
| |
Collapse
|
9
|
Espinosa E, Bautista R, Larrosa R, Plata O. Advancements in long-read genome sequencing technologies and algorithms. Genomics 2024; 116:110842. [PMID: 38608738 DOI: 10.1016/j.ygeno.2024.110842] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/31/2023] [Revised: 04/01/2024] [Accepted: 04/06/2024] [Indexed: 04/14/2024]
Abstract
The recent advent of long read sequencing technologies, such as Pacific Biosciences (PacBio) and Oxford Nanopore technology (ONT), have led to substantial improvements in accuracy and computational cost in sequencing genomes. However, de novo whole-genome assembly still presents significant challenges related to the quality of the results. Pursuing de novo whole-genome assembly remains a formidable challenge, underscored by intricate considerations surrounding computational demands and result quality. As sequencing accuracy and throughput steadily advance, a continuous stream of innovative assembly tools floods the field. Navigating this dynamic landscape necessitates a reasonable choice of sequencing platform, depth, and assembly tools to orchestrate high-quality genome reconstructions. This comprehensive review delves into the intricate interplay between cutting-edge long read sequencing technologies, assembly methodologies, and the ever-evolving field of genomics. With a focus on addressing the pivotal challenges and harnessing the opportunities presented by these advancements, we provide an in-depth exploration of the crucial factors influencing the selection of optimal strategies for achieving robust and insightful genome assemblies.
Collapse
Affiliation(s)
- Elena Espinosa
- Department of Computer Architecture, University of Malaga, Louis Pasteur, 35, Campus de Teatinos, Malaga 29071, Spain.
| | - Rocio Bautista
- Supercomputing and Bioinnovation Center, University of Malaga, C. Severo Ochoa, 34, Malaga 29590, Spain.
| | - Rafael Larrosa
- Department of Computer Architecture, University of Malaga, Louis Pasteur, 35, Campus de Teatinos, Malaga 29071, Spain; Supercomputing and Bioinnovation Center, University of Malaga, C. Severo Ochoa, 34, Malaga 29590, Spain.
| | - Oscar Plata
- Department of Computer Architecture, University of Malaga, Louis Pasteur, 35, Campus de Teatinos, Malaga 29071, Spain.
| |
Collapse
|
10
|
Bossert S, Pauly A, Danforth BN, Orr MC, Murray EA. Lessons from assembling UCEs: A comparison of common methods and the case of Clavinomia (Halictidae). Mol Ecol Resour 2024; 24:e13925. [PMID: 38183389 DOI: 10.1111/1755-0998.13925] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/21/2023] [Revised: 12/08/2023] [Accepted: 12/21/2023] [Indexed: 01/08/2024]
Abstract
Sequence data assembly is a foundational step in high-throughput sequencing, with untold consequences for downstream analyses. Despite this, few studies have interrogated the many methods for assembling phylogenomic UCE data for their comparative efficacy, or for how outputs may be impacted. We study this by comparing the most commonly used assembly methods for UCEs in the under-studied bee lineage Nomiinae and a representative sampling of relatives. Data for 63 UCE-only and 75 mixed taxa were assembled with five methods, including ABySS, HybPiper, SPAdes, Trinity and Velvet, and then benchmarked for their relative performance in terms of locus capture parameters and phylogenetic reconstruction. Unexpectedly, Trinity and Velvet trailed the other methods in terms of locus capture and DNA matrix density, whereas SPAdes performed favourably in most assessed metrics. In comparison with SPAdes, the guided-assembly approach HybPiper generally recovered the highest quality loci but in lower numbers. Based on our results, we formally move Clavinomia to Dieunomiini and render Epinomia once more a subgenus of Dieunomia. We strongly advise that future studies more closely examine the influence of assembly approach on their results, or, minimally, use better-performing assembly methods such as SPAdes or HybPiper. In this way, we can move forward with phylogenomic studies in a more standardized, comparable manner.
Collapse
Affiliation(s)
- Silas Bossert
- Department of Entomology, Washington State University, Pullman, Washington, USA
- Department of Entomology, National Museum of Natural History, Smithsonian Institution, Washington, DC, USA
| | - Alain Pauly
- Royal Belgian Institute of Natural Sciences, O.D. Taxonomy and Phylogeny, Brussels, Belgium
| | - Bryan N Danforth
- Department of Entomology, Cornell University, Ithaca, New York, USA
| | - Michael C Orr
- Entomologie, Staatliches Museum für Naturkunde Stuttgart, Stuttgart, Germany
| | - Elizabeth A Murray
- Department of Entomology, Washington State University, Pullman, Washington, USA
| |
Collapse
|
11
|
Do V, Nguyen S, Le D, Nguyen T, Nguyen C, Ho T, Vo N, Nguyen T, Nguyen H, Cao M. Pasa: leveraging population pangenome graph to scaffold prokaryote genome assemblies. Nucleic Acids Res 2024; 52:e15. [PMID: 38084888 PMCID: PMC10853769 DOI: 10.1093/nar/gkad1170] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/19/2023] [Revised: 11/07/2023] [Accepted: 11/22/2023] [Indexed: 02/10/2024] Open
Abstract
Whole genome sequencing has increasingly become the essential method for studying the genetic mechanisms of antimicrobial resistance and for surveillance of drug-resistant bacterial pathogens. The majority of bacterial genomes sequenced to date have been sequenced with Illumina sequencing technology, owing to its high-throughput, excellent sequence accuracy, and low cost. However, because of the short-read nature of the technology, these assemblies are fragmented into large numbers of contigs, hindering the obtaining of full information of the genome. We develop Pasa, a graph-based algorithm that utilizes the pangenome graph and the assembly graph information to improve scaffolding quality. By leveraging the population information of the bacteria species, Pasa is able to utilize the linkage information of the gene families of the species to resolve the contig graph of the assembly. We show that our method outperforms the current state of the arts in terms of accuracy, and at the same time, is computationally efficient to be applied to a large number of existing draft assemblies.
Collapse
Affiliation(s)
- Van Hoan Do
- Center for Applied Mathematics and Informatics, Le Quy Don Technical University, Hanoi, Vietnam
| | | | - Duc Quang Le
- Faculty of IT, Hanoi University of Civil Engineering, Hanoi, Vietnam
| | - Tam Thi Nguyen
- Oxford University Clinical Research Unit, Hanoi, Vietnam
| | - Canh Hao Nguyen
- Bioinformatics Center, Institute for Chemical Research, Kyoto University, Japan
| | - Tho Huu Ho
- Department of Medical Microbiology, The 103 Military Hospital, Vietnam Military Medical University, Hanoi, Vietnam
- Department of Genomics & Cytogenetics, Institute of Biomedicine & Pharmacy, Vietnam Military Medical University, Hanoi, Vietnam
| | - Nam S Vo
- Center for Biomedical Informatics, Vingroup Big Data Institute, Hanoi, Vietnam
| | | | | | | |
Collapse
|
12
|
Sami A, El-Metwally S, Rashad MZ. MAC-ErrorReads: machine learning-assisted classifier for filtering erroneous NGS reads. BMC Bioinformatics 2024; 25:61. [PMID: 38321434 PMCID: PMC10848413 DOI: 10.1186/s12859-024-05681-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/12/2023] [Accepted: 01/29/2024] [Indexed: 02/08/2024] Open
Abstract
BACKGROUND The rapid advancement of next-generation sequencing (NGS) machines in terms of speed and affordability has led to the generation of a massive amount of biological data at the expense of data quality as errors become more prevalent. This introduces the need to utilize different approaches to detect and filtrate errors, and data quality assurance is moved from the hardware space to the software preprocessing stages. RESULTS We introduce MAC-ErrorReads, a novel Machine learning-Assisted Classifier designed for filtering Erroneous NGS Reads. MAC-ErrorReads transforms the erroneous NGS read filtration process into a robust binary classification task, employing five supervised machine learning algorithms. These models are trained on features extracted through the computation of Term Frequency-Inverse Document Frequency (TF_IDF) values from various datasets such as E. coli, GAGE S. aureus, H. Chr14, Arabidopsis thaliana Chr1 and Metriaclima zebra. Notably, Naive Bayes demonstrated robust performance across various datasets, displaying high accuracy, precision, recall, F1-score, MCC, and ROC values. The MAC-ErrorReads NB model accurately classified S. aureus reads, surpassing most error correction tools with a 38.69% alignment rate. For H. Chr14, tools like Lighter, Karect, CARE, Pollux, and MAC-ErrorReads showed rates above 99%. BFC and RECKONER exceeded 98%, while Fiona had 95.78%. For the Arabidopsis thaliana Chr1, Pollux, Karect, RECKONER, and MAC-ErrorReads demonstrated good alignment rates of 92.62%, 91.80%, 91.78%, and 90.87%, respectively. For the Metriaclima zebra, Pollux achieved a high alignment rate of 91.23%, despite having the lowest number of mapped reads. MAC-ErrorReads, Karect, and RECKONER demonstrated good alignment rates of 83.76%, 83.71%, and 83.67%, respectively, while also producing reasonable numbers of mapped reads to the reference genome. CONCLUSIONS This study demonstrates that machine learning approaches for filtering NGS reads effectively identify and retain the most accurate reads, significantly enhancing assembly quality and genomic coverage. The integration of genomics and artificial intelligence through machine learning algorithms holds promise for enhancing NGS data quality, advancing downstream data analysis accuracy, and opening new opportunities in genetics, genomics, and personalized medicine research.
Collapse
Affiliation(s)
- Amira Sami
- Department of Computer Science, Faculty of Computers and Information, Mansoura University, P.O. Box: 35516, Mansoura, Egypt
| | - Sara El-Metwally
- Department of Computer Science, Faculty of Computers and Information, Mansoura University, P.O. Box: 35516, Mansoura, Egypt.
- Biomedical Informatics Department, Faculty of Computer Science and Engineering, New Mansoura University, Gamasa, 35712, Egypt.
| | - M Z Rashad
- Department of Computer Science, Faculty of Computers and Information, Mansoura University, P.O. Box: 35516, Mansoura, Egypt
| |
Collapse
|
13
|
Safar HA, Alatar F, Mustafa AS. Three Rounds of Read Correction Significantly Improve Eukaryotic Protein Detection in ONT Reads. Microorganisms 2024; 12:247. [PMID: 38399651 PMCID: PMC10893331 DOI: 10.3390/microorganisms12020247] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/29/2023] [Revised: 01/19/2024] [Accepted: 01/23/2024] [Indexed: 02/25/2024] Open
Abstract
BACKGROUND Eukaryotes' whole-genome sequencing is crucial for species identification, gene detection, and protein annotation. Oxford Nanopore Technology (ONT) is an affordable and rapid platform for sequencing eukaryotes; however, the relatively higher error rates require computational and bioinformatic efforts to produce more accurate genome assemblies. Here, we evaluated the effect of read correction tools on eukaryote genome completeness, gene detection and protein annotation. METHODS Reads generated by ONT of four eukaryotes, C. albicans, C. gattii, S. cerevisiae, and P. falciparum, were assembled using minimap2 and underwent three rounds of read correction using flye, medaka and racon. The generates consensus FASTA files were compared for total length (bp), genome completeness, gene detection, and protein-annotation by QUAST, BUSCO, BRAKER1 and InterProScan, respectively. RESULTS Genome completeness was dependent on the assembly method rather than on the read correction tool; however, medaka performed better than flye and racon. Racon significantly performed better than flye and medaka in gene detection, while both racon and medaka significantly performed better than flye in protein-annotation. CONCLUSION We show that three rounds of read correction significantly affect gene detection and protein annotation, which are dependent on assembly quality in preference to assembly completeness.
Collapse
Affiliation(s)
- Hussain A. Safar
- OMICS Research Unit, Health Science Centre, Kuwait University, Kuwait City 13110, Kuwait;
| | - Fatemah Alatar
- Serology and Molecular Microbiology Reference Laboratory, Mubarak Al-Kabeer Hospital, Ministry of Health, Kuwait City 13110, Kuwait;
| | - Abu Salim Mustafa
- Department of Microbiology, Faculty of Medicine, Kuwait University, Kuwait City 13110, Kuwait
| |
Collapse
|
14
|
Rádai Z, Váradi A, Takács P, Nagy NA, Schmitt N, Prépost E, Kardos G, Laczkó L. An overlooked phenomenon: complex interactions of potential error sources on the quality of bacterial de novo genome assemblies. BMC Genomics 2024; 25:45. [PMID: 38195441 PMCID: PMC10777565 DOI: 10.1186/s12864-023-09910-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/21/2022] [Accepted: 12/15/2023] [Indexed: 01/11/2024] Open
Abstract
BACKGROUND Parameters adversely affecting the contiguity and accuracy of the assemblies from Illumina next-generation sequencing (NGS) are well described. However, past studies generally focused on their additive effects, overlooking their potential interactions possibly exacerbating one another's effects in a multiplicative manner. To investigate whether or not they act interactively on de novo genome assembly quality, we simulated sequencing data for 13 bacterial reference genomes, with varying levels of error rate, sequencing depth, PCR and optical duplicate ratios. RESULTS We assessed the quality of assemblies from the simulated sequencing data with a number of contiguity and accuracy metrics, which we used to quantify both additive and multiplicative effects of the four parameters. We found that the tested parameters are engaged in complex interactions, exerting multiplicative, rather than additive, effects on assembly quality. Also, the ratio of non-repeated regions and GC% of the original genomes can shape how the four parameters affect assembly quality. CONCLUSIONS We provide a framework for consideration in future studies using de novo genome assembly of bacterial genomes, e.g. in choosing the optimal sequencing depth, balancing between its positive effect on contiguity and negative effect on accuracy due to its interaction with error rate. Furthermore, the properties of the genomes to be sequenced also should be taken into account, as they might influence the effects of error sources themselves.
Collapse
Affiliation(s)
- Zoltán Rádai
- Institute of Metagenomics, University of Debrecen, Debrecen, Hungary.
- Department of Dermatology, University Hospital Düsseldorf, Heinrich-Heine-University, Düsseldorf, Germany.
| | - Alex Váradi
- Institute of Metagenomics, University of Debrecen, Debrecen, Hungary
- Department of Laboratory Medicine, Medical School, University of Pécs, Pécs, Hungary
| | - Péter Takács
- Institute of Metagenomics, University of Debrecen, Debrecen, Hungary
- Department of Health Informatics, Institute of Health Sciences, Faculty of Health, University of Debrecen, Debrecen, Hungary
| | - Nikoletta Andrea Nagy
- Institute of Metagenomics, University of Debrecen, Debrecen, Hungary
- Department of Evolutionary Zoology, ELKH-DE Behavioural Ecology Research Group, University of Debrecen, Debrecen, Hungary
- Department of Evolutionary Zoology and Human Biology, University of Debrecen, Debrecen, Hungary
| | - Nicholas Schmitt
- Department of Dermatology, University Hospital Düsseldorf, Heinrich-Heine-University, Düsseldorf, Germany
| | - Eszter Prépost
- Department of Health Industry, University of Debrecen, Debrecen, Hungary
| | - Gábor Kardos
- Institute of Metagenomics, University of Debrecen, Debrecen, Hungary
- Department of Gerontology, Faculty of Health Sciences, University of Debrecen, Debrecen, Hungary
| | - Levente Laczkó
- Institute of Metagenomics, University of Debrecen, Debrecen, Hungary
- ELKH-DE Conservation Biology Research Group, Debrecen, Hungary
| |
Collapse
|
15
|
Thalén F, Köhne CG, Bleidorn C. Patchwork: Alignment-Based Retrieval and Concatenation of Phylogenetic Markers from Genomic Data. Genome Biol Evol 2023; 15:evad227. [PMID: 38085033 PMCID: PMC10735302 DOI: 10.1093/gbe/evad227] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 12/06/2023] [Indexed: 12/23/2023] Open
Abstract
Low-coverage whole-genome sequencing (also known as "genome skimming") is becoming an increasingly affordable approach to large-scale phylogenetic analyses. While already routinely used to recover organellar genomes, genome skimming is rather rarely utilized for recovering single-copy nuclear markers. One reason might be that only few tools exist to work with this data type within a phylogenomic context, especially to deal with fragmented genome assemblies. We here present a new software tool called Patchwork for mining phylogenetic markers from highly fragmented short-read assemblies as well as directly from sequence reads. Patchwork is an alignment-based tool that utilizes the sequence aligner DIAMOND and is written in the programming language Julia. Homologous regions are obtained via a sequence similarity search, followed by a "hit stitching" phase, in which adjacent or overlapping regions are merged into a single unit. The novel sliding window algorithm trims away any noncoding regions from the resulting sequence. We demonstrate the utility of Patchwork by recovering near-universal single-copy orthologs within a benchmarking study, and we additionally assess the performance of Patchwork in comparison with other programs. We find that Patchwork allows for accurate retrieval of (putatively) single-copy genes from genome skimming data sets at different sequencing depths with high computational speed, outperforming existing software targeting similar tasks. Patchwork is released under the GNU General Public License version 3. Installation instructions, additional documentation, and the source code itself are all available via GitHub at https://github.com/fethalen/Patchwork.
Collapse
Affiliation(s)
- Felix Thalén
- Department for Animal Evolution and Biodiversity, Georg-August-Universität Göttingen, Göttingen 37073, Germany
- Cardio-CARE AG, Medizincampus Davos, Davos Wolfgang 7265, Switzerland
| | - Clara G Köhne
- Department for Animal Evolution and Biodiversity, Georg-August-Universität Göttingen, Göttingen 37073, Germany
| | - Christoph Bleidorn
- Department for Animal Evolution and Biodiversity, Georg-August-Universität Göttingen, Göttingen 37073, Germany
| |
Collapse
|
16
|
Li K, Xu P, Wang J, Yi X, Jiao Y. Identification of errors in draft genome assemblies at single-nucleotide resolution for quality assessment and improvement. Nat Commun 2023; 14:6556. [PMID: 37848433 PMCID: PMC10582259 DOI: 10.1038/s41467-023-42336-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/13/2023] [Accepted: 10/05/2023] [Indexed: 10/19/2023] Open
Abstract
Assembly of a high-quality genome is important for downstream comparative and functional genomic studies. However, most tools for genome assembly assessment only give qualitative reports, which do not pinpoint assembly errors at specific regions. Here, we develop a new reference-free tool, Clipping information for Revealing Assembly Quality (CRAQ), which maps raw reads back to assembled sequences to identify regional and structural assembly errors based on effective clipped alignment information. Error counts are transformed into corresponding assembly evaluation indexes to reflect the assembly quality at single-nucleotide resolution. Notably, CRAQ distinguishes assembly errors from heterozygous sites or structural differences between haplotypes. This tool can clearly indicate low-quality regions and potential structural error breakpoints; thus, it can identify misjoined regions that should be split for further scaffold building and improvement of the assembly. We have benchmarked CRAQ on multiple genomes assembled using different strategies, and demonstrated the misjoin correction for improving the constructed pseudomolecules.
Collapse
Affiliation(s)
- Kunpeng Li
- State Key Laboratory of Plant Diversity and Specialty Crops, Institute of Botany, the Chinese Academy of Sciences, Beijing, China
- University of Chinese Academy of Sciences, Beijing, China
| | - Peng Xu
- State Key Laboratory of Plant Diversity and Specialty Crops, Institute of Botany, the Chinese Academy of Sciences, Beijing, China
- University of Chinese Academy of Sciences, Beijing, China
| | - Jinpeng Wang
- State Key Laboratory of Plant Diversity and Specialty Crops, Institute of Botany, the Chinese Academy of Sciences, Beijing, China
- University of Chinese Academy of Sciences, Beijing, China
| | - Xin Yi
- State Key Laboratory of Plant Diversity and Specialty Crops, Institute of Botany, the Chinese Academy of Sciences, Beijing, China
- China National Botanical Garden, Beijing, China
| | - Yuannian Jiao
- State Key Laboratory of Plant Diversity and Specialty Crops, Institute of Botany, the Chinese Academy of Sciences, Beijing, China.
- University of Chinese Academy of Sciences, Beijing, China.
- China National Botanical Garden, Beijing, China.
| |
Collapse
|
17
|
Narh Mensah DL, Wingfield BD, Coetzee MP. A practical approach to genome assembly and annotation of Basidiomycota using the example of Armillaria. Biotechniques 2023; 75:115-128. [PMID: 37681497 DOI: 10.2144/btn-2023-0023] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 09/09/2023] Open
Abstract
Technological advancements in genome sequencing, assembly and annotation platforms and algorithms that resulted in several genomic studies have created an opportunity to further our understanding of the biology of phytopathogens, including Armillaria species. Most Armillaria species are facultative necrotrophs that cause root- and stem-rot, usually on woody plants, significantly impacting agriculture and forestry worldwide. Genome sequencing, assembly and annotation in terms of samples used and methods applied in Armillaria genome projects are evaluated in this review. Infographic guidelines and a database of resources to facilitate future Armillaria genome projects were developed. Knowledge gained from genomic studies of Armillaria species is summarized and prospects for further research are provided. This guide can be applied to other diploid and dikaryotic fungal genomics.
Collapse
Affiliation(s)
- Deborah L Narh Mensah
- Department of Biochemistry, Genetics & Microbiology, Forestry & Agricultural Biotechnology Institute (FABI), Faculty of Natural & Agricultural Sciences, University of Pretoria, Pretoria, Gauteng, South Africa
- Council for Scientific and Industrial Research - Food Research Institute (CSIR-FRI), PO Box M20, Accra, Ghana
| | - Brenda D Wingfield
- Department of Biochemistry, Genetics & Microbiology, Forestry & Agricultural Biotechnology Institute (FABI), Faculty of Natural & Agricultural Sciences, University of Pretoria, Pretoria, Gauteng, South Africa
| | - Martin Pa Coetzee
- Department of Biochemistry, Genetics & Microbiology, Forestry & Agricultural Biotechnology Institute (FABI), Faculty of Natural & Agricultural Sciences, University of Pretoria, Pretoria, Gauteng, South Africa
| |
Collapse
|
18
|
Ritsch M, Cassman NA, Saghaei S, Marz M. Navigating the Landscape: A Comprehensive Review of Current Virus Databases. Viruses 2023; 15:1834. [PMID: 37766241 PMCID: PMC10537806 DOI: 10.3390/v15091834] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/04/2023] [Revised: 08/18/2023] [Accepted: 08/21/2023] [Indexed: 09/29/2023] Open
Abstract
Viruses are abundant and diverse entities that have important roles in public health, ecology, and agriculture. The identification and surveillance of viruses rely on an understanding of their genome organization, sequences, and replication strategy. Despite technological advancements in sequencing methods, our current understanding of virus diversity remains incomplete, highlighting the need to explore undiscovered viruses. Virus databases play a crucial role in providing access to sequences, annotations and other metadata, and analysis tools for studying viruses. However, there has not been a comprehensive review of virus databases in the last five years. This study aimed to fill this gap by identifying 24 active virus databases and included an extensive evaluation of their content, functionality and compliance with the FAIR principles. In this study, we thoroughly assessed the search capabilities of five database catalogs, which serve as comprehensive repositories housing a diverse array of databases and offering essential metadata. Moreover, we conducted a comprehensive review of different types of errors, encompassing taxonomy, names, missing information, sequences, sequence orientation, and chimeric sequences, with the intention of empowering users to effectively tackle these challenges. We expect this review to aid users in selecting suitable virus databases and other resources, and to help databases in error management and improve their adherence to the FAIR principles. The databases listed here represent the current knowledge of viruses and will help aid users find databases of interest based on content, functionality, and scope. The use of virus databases is integral to gaining new insights into the biology, evolution, and transmission of viruses, and developing new strategies to manage virus outbreaks and preserve global health.
Collapse
Affiliation(s)
- Muriel Ritsch
- RNA Bioinformatics and High-Throughput Analysis, Friedrich Schiller University Jena, 07743 Jena, Germany;
- European Virus Bioinformatics Center, 07743 Jena, Germany
| | - Noriko A. Cassman
- RNA Bioinformatics and High-Throughput Analysis, Friedrich Schiller University Jena, 07743 Jena, Germany;
- European Virus Bioinformatics Center, 07743 Jena, Germany
| | - Shahram Saghaei
- RNA Bioinformatics and High-Throughput Analysis, Friedrich Schiller University Jena, 07743 Jena, Germany;
- European Virus Bioinformatics Center, 07743 Jena, Germany
| | - Manja Marz
- RNA Bioinformatics and High-Throughput Analysis, Friedrich Schiller University Jena, 07743 Jena, Germany;
- European Virus Bioinformatics Center, 07743 Jena, Germany
- German Centre for Integrative Biodiversity Research (iDiv) Halle-Jena-Leipzig, 04103 Leipzig, Germany
- FLI Leibniz Institute for Age Research, 07745 Jena, Germany
| |
Collapse
|
19
|
Safar HA, Alatar F, Nasser K, Al-Ajmi R, Alfouzan W, Mustafa AS. The impact of applying various de novo assembly and correction tools on the identification of genome characterization, drug resistance, and virulence factors of clinical isolates using ONT sequencing. BMC Biotechnol 2023; 23:26. [PMID: 37525145 PMCID: PMC10391896 DOI: 10.1186/s12896-023-00797-3] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/02/2023] [Accepted: 07/21/2023] [Indexed: 08/02/2023] Open
Abstract
Oxford Nanopore sequencing technology (ONT) is currently widely used due to its affordability, simplicity, and reliability. Despite the advantage ONT has over next-generation sequencing in detecting resistance genes in mobile genetic elements, its relatively high error rate (10-15%) is still a deterrent. Several bioinformatic tools are freely available for raw data processing and obtaining complete and more accurate genome assemblies. In this study, we evaluated the impact of using mix-and-matched read assembly (Flye, Canu, Wtdbg2, and NECAT) and read correction (Medaka, NextPolish, and Racon) tools in generating complete and accurate genome assemblies, and downstream genomic analysis of nine clinical Escherichia coli isolates. Flye and Canu assemblers were the most robust in genome assembly, and Medaka and Racon correction tools significantly improved assembly parameters. Flye functioned well in pan-genome analysis, while Medaka increased the number of core genes detected. Flye, Canu, and NECAT assembler functioned well in detecting antimicrobial resistance genes (AMR), while Wtdbg2 required correction tools for better detection. Flye was the best assembler for detecting and locating both virulence and AMR genes (i.e., chromosomal vs. plasmid). This study provides insight into the performance of several read assembly and read correction tools for analyzing ONT sequencing reads for clinical isolates.
Collapse
Affiliation(s)
- Hussain A Safar
- OMICS Research Unit, Health Science Centre, Kuwait University, Hawalli Governorate, Kuwait
| | - Fatemah Alatar
- Serology and Molecular Microbiology Reference Laboratory, Mubarak Al-Kabeer Hospital, Ministry of Health, Hawalli Governorate, Kuwait
| | - Kother Nasser
- Serology and Molecular Microbiology Reference Laboratory, Mubarak Al-Kabeer Hospital, Ministry of Health, Hawalli Governorate, Kuwait
| | - Rehab Al-Ajmi
- Department of Microbiology, Faculty of Medicine, Kuwait University, Hawalli Governorate, Kuwait
| | - Wadha Alfouzan
- Department of Microbiology, Faculty of Medicine, Kuwait University, Hawalli Governorate, Kuwait
- Microbiology Unit, Farwaniya Hospital, Ministry of Health, Al Farwaniyah Governorate, Kuwait
| | - Abu Salim Mustafa
- Department of Microbiology, Faculty of Medicine, Kuwait University, Hawalli Governorate, Kuwait.
| |
Collapse
|
20
|
Modahl CM, Chowdhury A, Low DHW, Manuel MC, Missé D, Kini RM, Mendenhall IH, Pompon J. Midgut transcriptomic responses to dengue and chikungunya viruses in the vectors Aedes albopictus and Aedes malayensis. Sci Rep 2023; 13:11271. [PMID: 37438463 DOI: 10.1038/s41598-023-38354-9] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/15/2022] [Accepted: 07/06/2023] [Indexed: 07/14/2023] Open
Abstract
Dengue (DENV) and chikungunya (CHIKV) viruses are among the most preponderant arboviruses. Although primarily transmitted through the bite of Aedes aegypti mosquitoes, Aedes albopictus and Aedes malayensis are competent vectors and have an impact on arbovirus epidemiology. Here, to fill the gap in our understanding of the molecular interactions between secondary vectors and arboviruses, we used transcriptomics to profile the whole-genome responses of A. albopictus to CHIKV and of A. malayensis to CHIKV and DENV at 1 and 4 days post-infection (dpi) in midguts. In A. albopictus, 1793 and 339 genes were significantly regulated by CHIKV at 1 and 4 dpi, respectively. In A. malayensis, 943 and 222 genes upon CHIKV infection, and 74 and 69 genes upon DENV infection were significantly regulated at 1 and 4 dpi, respectively. We reported 81 genes that were consistently differentially regulated in all the CHIKV-infected conditions, identifying a CHIKV-induced signature. We identified expressed immune genes in both mosquito species, using a de novo assembled midgut transcriptome for A. malayensis, and described the immune architectures. We found the JNK pathway activated in all conditions, generalizing its antiviral function to Aedines. Our comprehensive study provides insight into arbovirus transmission by multiple Aedes vectors.
Collapse
Affiliation(s)
- Cassandra M Modahl
- Department of Biological Sciences, National University of Singapore, Singapore, Singapore
- Liverpool School of Tropical Medicine, Liverpool, U.K
| | - Avisha Chowdhury
- Department of Biological Sciences, National University of Singapore, Singapore, Singapore
- Toronto Centre for Liver Disease, Toronto General Hospital, University Health Network, University of Toronto, Toronto, Canada
| | - Dolyce H W Low
- Programme in Emerging Infectious Diseases, Duke-NUS Medical School, Singapore, Singapore
| | - Menchie C Manuel
- Programme in Emerging Infectious Diseases, Duke-NUS Medical School, Singapore, Singapore
| | - Dorothée Missé
- MIVEGEC, Univ. Montpellier, IRD, CNRS, Montpellier, France
| | - R Manjunatha Kini
- Department of Biological Sciences, National University of Singapore, Singapore, Singapore
- Department of Pharmacology, Yong Loo Lin School of Medicine, National University of Singapore, Singapore, Singapore
| | - Ian H Mendenhall
- Programme in Emerging Infectious Diseases, Duke-NUS Medical School, Singapore, Singapore
| | - Julien Pompon
- Programme in Emerging Infectious Diseases, Duke-NUS Medical School, Singapore, Singapore.
- MIVEGEC, Univ. Montpellier, IRD, CNRS, Montpellier, France.
| |
Collapse
|
21
|
Rahman Hera M, Pierce-Ward NT, Koslicki D. Deriving confidence intervals for mutation rates across a wide range of evolutionary distances using FracMinHash. Genome Res 2023; 33:1061-1068. [PMID: 37344105 PMCID: PMC10538494 DOI: 10.1101/gr.277651.123] [Citation(s) in RCA: 8] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/04/2023] [Accepted: 06/06/2023] [Indexed: 06/23/2023]
Abstract
Sketching methods offer computational biologists scalable techniques to analyze data sets that continue to grow in size. MinHash is one such technique to estimate set similarity that has enjoyed recent broad application. However, traditional MinHash has previously been shown to perform poorly when applied to sets of very dissimilar sizes. FracMinHash was recently introduced as a modification of MinHash to compensate for this lack of performance when set sizes differ. This approach has been successfully applied to metagenomic taxonomic profiling in the widely used tool sourmash gather. Although experimental evidence has been encouraging, FracMinHash has not yet been analyzed from a theoretical perspective. In this paper, we perform such an analysis to derive various statistics of FracMinHash, and prove that although FracMinHash is not unbiased (in the sense that its expected value is not equal to the quantity it attempts to estimate), this bias is easily corrected for both the containment and Jaccard index versions. Next, we show how FracMinHash can be used to compute point estimates as well as confidence intervals for evolutionary mutation distance between a pair of sequences by assuming a simple mutation model. We also investigate edge cases in which these analyses may fail to effectively warn the users of FracMinHash indicating the likelihood of such cases. Our analyses show that FracMinHash estimates the containment of a genome in a large metagenome more accurately and more precisely compared with traditional MinHash, and the point estimates and confidence intervals perform significantly better in estimating mutation distances.
Collapse
Affiliation(s)
- Mahmudur Rahman Hera
- Department of Computer Science and Engineering, The Pennsylvania State University, State College, Pennsylvania 16801, USA
| | - N Tessa Pierce-Ward
- Department of Population Health and Reproduction, University of California, Davis, California 95616, USA
| | - David Koslicki
- Department of Computer Science and Engineering, The Pennsylvania State University, State College, Pennsylvania 16801, USA;
- Department of Biology, The Pennsylvania State University, State College, Pennsylvania 16801, USA
- Huck Institutes of the Life Sciences, The Pennsylvania State University, State College, Pennsylvania 16801, USA
| |
Collapse
|
22
|
Cao S, Li M, Li LM. RegCloser: a robust regression approach to closing genome gaps. BMC Bioinformatics 2023; 24:249. [PMID: 37312038 DOI: 10.1186/s12859-023-05367-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/01/2023] [Accepted: 05/27/2023] [Indexed: 06/15/2023] Open
Abstract
BACKGROUND Closing gaps in draft genomes leads to more complete and continuous genome assemblies. The ubiquitous genomic repeats are challenges to the existing gap-closing methods, based on either the k-mer representation by the de Bruijn graph or the overlap-layout-consensus paradigm. Besides, chimeric reads will cause erroneous k-mers in the former and false overlaps of reads in the latter. RESULTS We propose a novel local assembly approach to gap closing, called RegCloser. It represents read coordinates and their overlaps respectively by parameters and observations in a linear regression model. The optimal overlap is searched only in the restricted range consistent with insert sizes. Under this linear regression framework, the local DNA assembly becomes a robust parameter estimation problem. We solved the problem by a customized robust regression procedure that resists the influence of false overlaps by optimizing a convex global Huber loss function. The global optimum is obtained by iteratively solving the sparse system of linear equations. On both simulated and real datasets, RegCloser outperformed other popular methods in accurately resolving the copy number of tandem repeats, and achieved superior completeness and contiguity. Applying RegCloser to a plateau zokor draft genome that had been improved by long reads further increased contig N50 to 3-fold long. We also tested the robust regression approach on layout generation of long reads. CONCLUSIONS RegCloser is a competitive gap-closing tool. The software is available at https://github.com/csh3/RegCloser . The robust regression approach has a prospect to be incorporated into the layout module of long read assemblers.
Collapse
Affiliation(s)
- Shenghao Cao
- National Center of Mathematics and Interdisciplinary Sciences, Academy of Mathematics and Systems Science, Chinese Academy of Sciences, Beijing, 100190, China
- University of Chinese Academy of Sciences, Beijing, 100049, China
| | - Mengtian Li
- National Center of Mathematics and Interdisciplinary Sciences, Academy of Mathematics and Systems Science, Chinese Academy of Sciences, Beijing, 100190, China
- University of Chinese Academy of Sciences, Beijing, 100049, China
| | - Lei M Li
- National Center of Mathematics and Interdisciplinary Sciences, Academy of Mathematics and Systems Science, Chinese Academy of Sciences, Beijing, 100190, China.
- University of Chinese Academy of Sciences, Beijing, 100049, China.
| |
Collapse
|
23
|
Ferguson S, Jones A, Murray K, Schwessinger B, Borevitz JO. Interspecies genome divergence is predominantly due to frequent small scale rearrangements in Eucalyptus. Mol Ecol 2023; 32:1271-1287. [PMID: 35810343 DOI: 10.1111/mec.16608] [Citation(s) in RCA: 8] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/17/2022] [Revised: 07/02/2022] [Accepted: 07/04/2022] [Indexed: 11/27/2022]
Abstract
Synteny, the ordering of sequences within homologous chromosomes, must be maintained within the genomes of sexually reproducing species for the sharing of alleles and production of viable, reproducing offspring. However, when the genomes of closely related species are compared, a loss of synteny is often observed. Unequal homologous recombination is the primary mechanism behind synteny loss, occurring more often in transposon rich regions, and resulting in the formation of chromosomal rearrangements. To examine patterns of synteny among three closely related, interbreeding, and wild Eucalyptus species, we assembled their genomes using long-read DNA sequencing and de novo assembly. We identify syntenic and rearranged regions between these genomes and estimate that ~48% of our genomes remain syntenic while ~36% is rearranged. We observed that rearrangements highly fragment microsynteny. Our results suggest that synteny between these species is primarily lost through small-scale rearrangements, not through sequence loss, gain, or sequence divergence. Further examination of identified rearrangements suggests that rearrangements may be altering the phenotypes of Eucalyptus species. Our study also underscores that the use of single reference genomes in genomic variation studies could lead to reference bias, especially given the scale at which we show potentially adaptive loci have highly diverged, deleted, duplicated and/or rearranged. This study provides an unbiased framework to look at potential speciation and adaptive loci among a rapidly radiating foundation species of woodland trees that are free from selective breeding seen in most crop species.
Collapse
Affiliation(s)
- Scott Ferguson
- Research School of Biology, Australian National University, Canberra, Australian Capital Territory, Australia
| | - Ashley Jones
- Research School of Biology, Australian National University, Canberra, Australian Capital Territory, Australia
| | - Kevin Murray
- Research School of Biology, Australian National University, Canberra, Australian Capital Territory, Australia.,Weigel Department, Max Planck Institute for Developmental Biology, Tuebingen, Germany
| | - Benjamin Schwessinger
- Research School of Biology, Australian National University, Canberra, Australian Capital Territory, Australia
| | - Justin O Borevitz
- Research School of Biology, Australian National University, Canberra, Australian Capital Territory, Australia
| |
Collapse
|
24
|
Wang P, Wang F. A proposed metric set for evaluation of genome assembly quality. Trends Genet 2023; 39:175-186. [PMID: 36402623 DOI: 10.1016/j.tig.2022.10.005] [Citation(s) in RCA: 12] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/05/2022] [Revised: 10/24/2022] [Accepted: 10/26/2022] [Indexed: 11/18/2022]
Abstract
Quality control is essential for genome assemblies; however, a consensus has yet to be reached on what metrics should be adopted for the evaluation of assembly quality. N50 is widely used for contiguity measurement, but its effectiveness is constantly in question. Prevailing metrics for the completeness evaluation focus on gene space, yet challenging areas such as tandem repeats are commonly overlooked. Achieving correctness has become an indispensable dimension for quality control, while prevailing assembly releases lack scores reflecting this aspect. We propose a metric set with a set of statistic indexes for effective, comprehensive evaluation of assemblies and provide a score of a finished assembly for each metric, which can be utilized as a benchmark for achieving high-quality genome assemblies.
Collapse
Affiliation(s)
- Peng Wang
- Key Laboratory of Crop Gene Resources and Germplasm Enhancement in Southern China, Ministry of Agriculture and Rural Affairs, Institute of Tropical Crop Genetic Resources, Chinese Academy of Tropical Agricultural Sciences, No. 4 Xueyuan Rd, Haikou City, Hainan 571101, China.
| | - Fei Wang
- School of Electrical and Electronic Engineering, Shanghai Institute of Technology, No. 100 Haiquan Rd, Shanghai 201416, China.
| |
Collapse
|
25
|
Sahlin K. Strobealign: flexible seed size enables ultra-fast and accurate read alignment. Genome Biol 2022; 23:260. [PMID: 36522758 PMCID: PMC9753264 DOI: 10.1186/s13059-022-02831-7] [Citation(s) in RCA: 18] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/24/2022] [Accepted: 12/02/2022] [Indexed: 12/23/2022] Open
Abstract
Read alignment is often the computational bottleneck in analyses. Recently, several advances have been made on seeding methods for fast sequence comparison. We combine two such methods, syncmers and strobemers, in a novel seeding approach for constructing dynamic-sized fuzzy seeds and implement the method in a short-read aligner, strobealign. The seeding is fast to construct and effectively reduces repetitiveness in the seeding step, as shown using a novel metric E-hits. strobealign is several times faster than traditional aligners at similar and sometimes higher accuracy while being both faster and more accurate than more recently proposed aligners for short reads of lengths 150nt and longer. Availability: https://github.com/ksahlin/strobealign.
Collapse
Affiliation(s)
- Kristoffer Sahlin
- Department of Mathematics, Science for Life Laboratory, Stockholm University, 106 91, Stockholm, Sweden.
| |
Collapse
|
26
|
Bassi C, Guerriero P, Pierantoni M, Callegari E, Sabbioni S. Novel Virus Identification through Metagenomics: A Systematic Review. LIFE (BASEL, SWITZERLAND) 2022; 12:life12122048. [PMID: 36556413 PMCID: PMC9784588 DOI: 10.3390/life12122048] [Citation(s) in RCA: 6] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 10/24/2022] [Revised: 11/25/2022] [Accepted: 12/01/2022] [Indexed: 12/12/2022]
Abstract
Metagenomic Next Generation Sequencing (mNGS) allows the evaluation of complex microbial communities, avoiding isolation and cultivation of each microbial species, and does not require prior knowledge of the microbial sequences present in the sample. Applications of mNGS include virome characterization, new virus discovery and full-length viral genome reconstruction, either from virus preparations enriched in culture or directly from clinical and environmental specimens. Here, we systematically reviewed studies that describe novel virus identification through mNGS from samples of different origin (plant, animal and environment). Without imposing time limits to the search, 379 publications were identified that met the search parameters. Sample types, geographical origin, enrichment and nucleic acid extraction methods, sequencing platforms, bioinformatic analytical steps and identified viral families were described. The review highlights mNGS as a feasible method for novel virus discovery from samples of different origins, describes which kind of heterogeneous experimental and analytical protocols are currently used and provides useful information such as the different commercial kits used for the purification of nucleic acids and bioinformatics analytical pipelines.
Collapse
Affiliation(s)
- Cristian Bassi
- Department of Translational Medicine, University of Ferrara, 44121 Ferrara, Italy
- Laboratorio per Le Tecnologie delle Terapie Avanzate (LTTA), University of Ferrara, 44121 Ferrara, Italy
| | - Paola Guerriero
- Department of Translational Medicine, University of Ferrara, 44121 Ferrara, Italy
- Laboratorio per Le Tecnologie delle Terapie Avanzate (LTTA), University of Ferrara, 44121 Ferrara, Italy
| | - Marina Pierantoni
- Department of Translational Medicine, University of Ferrara, 44121 Ferrara, Italy
| | - Elisa Callegari
- Department of Translational Medicine, University of Ferrara, 44121 Ferrara, Italy
| | - Silvia Sabbioni
- Laboratorio per Le Tecnologie delle Terapie Avanzate (LTTA), University of Ferrara, 44121 Ferrara, Italy
- Department of Life Science and Biotechnology, University of Ferrara, 44121 Ferrara, Italy
- Correspondence: ; Tel.: +39-053-245-5319
| |
Collapse
|
27
|
Lai S, Pan S, Sun C, Coelho LP, Chen WH, Zhao XM. metaMIC: reference-free misassembly identification and correction of de novo metagenomic assemblies. Genome Biol 2022; 23:242. [PMID: 36376928 PMCID: PMC9661791 DOI: 10.1186/s13059-022-02810-y] [Citation(s) in RCA: 7] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/10/2021] [Accepted: 11/01/2022] [Indexed: 11/16/2022] Open
Abstract
Evaluating the quality of metagenomic assemblies is important for constructing reliable metagenome-assembled genomes and downstream analyses. Here, we present metaMIC ( https://github.com/ZhaoXM-Lab/metaMIC ), a machine learning-based tool for identifying and correcting misassemblies in metagenomic assemblies. Benchmarking results on both simulated and real datasets demonstrate that metaMIC outperforms existing tools when identifying misassembled contigs. Furthermore, metaMIC is able to localize the misassembly breakpoints, and the correction of misassemblies by splitting at misassembly breakpoints can improve downstream scaffolding and binning results.
Collapse
Affiliation(s)
- Senying Lai
- Institute of Science and Technology for Brain-Inspired Intelligence, Fudan University, Shanghai, China
| | - Shaojun Pan
- Institute of Science and Technology for Brain-Inspired Intelligence, Fudan University, Shanghai, China
| | - Chuqing Sun
- Key Laboratory of Molecular Biophysics of the Ministry of Education, Hubei Key Laboratory of Bioinformatics and Molecular-imaging, Center for Artificial Intelligence Biology, Department of Bioinformatics and Systems Biology, College of Life Science and Technology, Huazhong University of Science and Technology, Wuhan, Hubei China
| | - Luis Pedro Coelho
- Institute of Science and Technology for Brain-Inspired Intelligence, Fudan University, Shanghai, China
- MOE Key Laboratory of Computational Neuroscience and Brain-Inspired Intelligence, and MOE Frontiers Center for Brain Science, Fudan University, Shanghai, China
| | - Wei-Hua Chen
- Key Laboratory of Molecular Biophysics of the Ministry of Education, Hubei Key Laboratory of Bioinformatics and Molecular-imaging, Center for Artificial Intelligence Biology, Department of Bioinformatics and Systems Biology, College of Life Science and Technology, Huazhong University of Science and Technology, Wuhan, Hubei China
- College of Life Science, Henan Normal University, Xinxiang, Henan China
| | - Xing-Ming Zhao
- Institute of Science and Technology for Brain-Inspired Intelligence, Fudan University, Shanghai, China
- MOE Key Laboratory of Computational Neuroscience and Brain-Inspired Intelligence, and MOE Frontiers Center for Brain Science, Fudan University, Shanghai, China
- State Key Laboratory of Medical Neurobiology, Institutes of Brain Science, Fudan University, Shanghai, China
- Research Institute of Intelligent Complex Systems, Fudan University, Shanghai, China
- International Human Phenome Institutes (Shanghai), Shanghai, China
- Zhangjiang Fudan International Innovation Center, Shanghai, China
| |
Collapse
|
28
|
Caceres M, Mumey B, Husic E, Rizzi R, Cairo M, Sahlin K, Tomescu AI. Safety in Multi-Assembly via Paths Appearing in All Path Covers of a DAG. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2022; 19:3673-3684. [PMID: 34847041 DOI: 10.1109/tcbb.2021.3131203] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/13/2023]
Abstract
A multi-assembly problem asks to reconstruct multiple genomic sequences from mixed reads sequenced from all of them. Standard formulations of such problems model a solution as a path cover in a directed acyclic graph, namely a set of paths that together cover all vertices of the graph. Since multi-assembly problems admit multiple solutions in practice, we consider an approach commonly used in standard genome assembly: output only partial solutions (contigs, or safe paths), that appear in all path cover solutions. We study constrained path covers, a restriction on the path cover solution that incorporate practical constraints arising in multi-assembly problems. We give efficient algorithms finding all maximal safe paths for constrained path covers. We compute the safe paths of splicing graphs constructed from transcript annotations of different species. Our algorithms run in less than 15 seconds per species and report RNA contigs that are over 99% precise and are up to 8 times longer than unitigs. Moreover, RNA contigs cover over 70% of the transcripts and their coding sequences in most cases. With their increased length to unitigs, high precision, and fast construction time, maximal safe paths can provide a better base set of sequences for transcript assembly programs.
Collapse
|
29
|
Genome sequence assembly algorithms and misassembly identification methods. Mol Biol Rep 2022; 49:11133-11148. [PMID: 36151399 DOI: 10.1007/s11033-022-07919-8] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/20/2022] [Accepted: 09/05/2022] [Indexed: 10/14/2022]
Abstract
The sequence assembly algorithms have rapidly evolved with the vigorous growth of genome sequencing technology over the past two decades. Assembly mainly uses the iterative expansion of overlap relationships between sequences to construct the target genome. The assembly algorithms can be typically classified into several categories, such as the Greedy strategy, Overlap-Layout-Consensus (OLC) strategy, and de Bruijn graph (DBG) strategy. In particular, due to the rapid development of third-generation sequencing (TGS) technology, some prevalent assembly algorithms have been proposed to generate high-quality chromosome-level assemblies. However, due to the genome complexity, the length of short reads, and the high error rate of long reads, contigs produced by assembly may contain misassemblies adversely affecting downstream data analysis. Therefore, several read-based and reference-based methods for misassembly identification have been developed to improve assembly quality. This work primarily reviewed the development of DNA sequencing technologies and summarized sequencing data simulation methods, sequencing error correction methods, various mainstream sequence assembly algorithms, and misassembly identification methods. A large amount of computation makes the sequence assembly problem more challenging, and therefore, it is necessary to develop more efficient and accurate assembly algorithms and alternative algorithms.
Collapse
|
30
|
Hilt EE, Ferrieri P. Next Generation and Other Sequencing Technologies in Diagnostic Microbiology and Infectious Diseases. Genes (Basel) 2022; 13:genes13091566. [PMID: 36140733 PMCID: PMC9498426 DOI: 10.3390/genes13091566] [Citation(s) in RCA: 52] [Impact Index Per Article: 17.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/28/2022] [Revised: 08/24/2022] [Accepted: 08/26/2022] [Indexed: 12/03/2022] Open
Abstract
Next-generation sequencing (NGS) technologies have become increasingly available for use in the clinical microbiology diagnostic environment. There are three main applications of these technologies in the clinical microbiology laboratory: whole genome sequencing (WGS), targeted metagenomics sequencing and shotgun metagenomics sequencing. These applications are being utilized for initial identification of pathogenic organisms, the detection of antimicrobial resistance mechanisms and for epidemiologic tracking of organisms within and outside hospital systems. In this review, we analyze these three applications and provide a comprehensive summary of how these applications are currently being used in public health, basic research, and clinical microbiology laboratory environments. In the public health arena, WGS is being used to identify and epidemiologically track food borne outbreaks and disease surveillance. In clinical hospital systems, WGS is used to identify multi-drug-resistant nosocomial infections and track the transmission of these organisms. In addition, we examine how metagenomics sequencing approaches (targeted and shotgun) are being used to circumvent the traditional and biased microbiology culture methods to identify potential pathogens directly from specimens. We also expand on the important factors to consider when implementing these technologies, and what is possible for these technologies in infectious disease diagnosis in the next 5 years.
Collapse
|
31
|
Pickett BD, Glass JR, Johnson TP, Ridge PG, Kauwe JSK. The genome of a giant (trevally): Caranx ignobilis. GIGABYTE 2022; 2022:gigabyte67. [PMID: 36824527 PMCID: PMC9694125 DOI: 10.46471/gigabyte.67] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/24/2022] [Accepted: 08/25/2022] [Indexed: 11/09/2022] Open
Abstract
Caranx ignobilis, commonly known as giant kingfish or giant trevally, is a large, reef-associated apex predator. It is a prized sportfish, targeted throughout its tropical and subtropical range in the Indian and Pacific Oceans. It also gained significant interest in aquaculture due to its unusual freshwater tolerance. Here, we present a draft assembly of the estimated 625.92 Mbp nuclear genome of a C. ignobilis individual from Hawaiian waters, which host a genetically distinct population. Our 97.4% BUSCO-complete assembly has a contig NG50 of 7.3 Mbp and a scaffold NG50 of 46.3 Mbp. Twenty-five of the 203 scaffolds contain 90% of the genome. We also present noisy, long-read DNA, Hi-C, and RNA-seq datasets, the latter containing eight distinct tissues and can help with annotations and studies of freshwater tolerance. Our genome assembly and its supporting data are valuable tools for ecological and comparative genomics studies of kingfishes and other carangoid fishes.
Collapse
Affiliation(s)
| | - Jessica R. Glass
- South African Institute for Aquatic Biodiversity, Makhanda, South Africa
- College of Fisheries and Ocean Sciences, University of Alaska Fairbanks, Fairbanks, Alaska, USA
| | | | - Perry G. Ridge
- Department of Biology, Brigham Young University, Provo, Utah, USA
| | - John S. K. Kauwe
- Department of Biology, Brigham Young University, Provo, Utah, USA
- Brigham Young University – Hawai‘i, Laie, Hawai‘i, USA
| |
Collapse
|
32
|
Guo R, Papanicolaou A, Fritz ML. Validation of reference-assisted assembly using existing and novel Heliothine genomes. Genomics 2022; 114:110441. [PMID: 35931274 DOI: 10.1016/j.ygeno.2022.110441] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/29/2021] [Revised: 07/19/2022] [Accepted: 07/29/2022] [Indexed: 11/16/2022]
Abstract
Chloridea subflexa and Chloridea virescens are a pair of closely related noctuid species exhibiting pheromone-based sexual isolation and divergent host plant preferences. We produced a novel Illumina short read C. subflexa genome assembly and an improved C. virescens genome assembly, which offer opportunities to study the genomic basis for evolutionarily important traits in this lepidopteran family with few genomic resources. We then examined the feasibility of reference-assisted assembly, an approach that leverages existing high quality genomic resources for genome improvement in closely related taxa and applied it to our Heliothine genomes. Our work demonstrates that reference-assisted assembly has the potential to enhance contiguity and completeness of existing insect genomic resources with minimal additional laboratory costs. We conclude by discussing both the potential and pitfalls of reference-assisted assembly according to the intended downstream assembly application.
Collapse
Affiliation(s)
- Rong Guo
- Department of Entomology, University of Maryland, College Park, MD 20742, USA; Computational Biology, Bioinformatics and Genomics Program, Department of Biological Sciences, University of Maryland, College Park, MD 20742, USA
| | - Alexie Papanicolaou
- Hawkesbury Institute for the Environment, Western Sydney University, Richmond, NSW 2753, Australia.
| | - Megan L Fritz
- Department of Entomology, University of Maryland, College Park, MD 20742, USA; Computational Biology, Bioinformatics and Genomics Program, Department of Biological Sciences, University of Maryland, College Park, MD 20742, USA.
| |
Collapse
|
33
|
Gupta AK, Kumar M. Benchmarking and Assessment of Eight De Novo Genome Assemblers on Viral Next-Generation Sequencing Data, Including the SARS-CoV-2. OMICS : A JOURNAL OF INTEGRATIVE BIOLOGY 2022; 26:372-381. [PMID: 35759429 DOI: 10.1089/omi.2022.0042] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/15/2023]
Abstract
Viral genomics has become crucial in clinical diagnostics and ecology, not to mention to stem the COVID-19 pandemic. Whole-genome sequencing (WGS) is pivotal in gaining an improved understanding of viral evolution, genomic epidemiology, infectious outbreaks, pathobiology, clinical management, and vaccine development. Genome assembly is one of the crucial steps in WGS data analyses. A series of different assemblers has been developed with the advent of high-throughput next-generation sequencing (NGS). Various studies have reported the evaluation of these assembly tools on distinct datasets; however, these lack data from viral origin. In this study, we performed a comparative evaluation and benchmarking of eight de novo assemblers: SOAPdenovo, Velvet, assembly by short sequences (ABySS), iterative De Bruijn graph assembler (IDBA), SPAdes, Edena, iterative virus assembler, and VICUNA on the viral NGS data from distinct Illumina (GAIIx, Hiseq, Miseq, and Nextseq) platforms. WGS data of diverse viruses, that is, severe acute respiratory syndrome coronavirus-2 (SARS-CoV-2), dengue virus 3, human immunodeficiency virus 1, hepatitis B virus, human herpesvirus 8, human papillomavirus 16, rhinovirus A, and West Nile virus, were utilized to assess these assemblers. Performance metrics such as genome fraction recovery, assembly lengths, NG50, N50, contig length, contig numbers, mismatches, and misassemblies were analyzed. Overall, three assemblers, that is, SPAdes, IDBA, and ABySS, performed consistently well, including for genome assembly of SARS-CoV-2. These assembly methods should be considered and recommended for future studies of viruses. The study also suggests that implementing two or more assembly approaches should be considered in viral NGS studies, especially in clinical settings. Taken together, the benchmarking of eight de novo genome assemblers reported in this study can inform future public health and ecology research concerning the viruses, the COVID-19 pandemic, and viral outbreaks.
Collapse
Affiliation(s)
- Amit Kumar Gupta
- Virology Unit and Bioinformatics Centre, Institute of Microbial Technology, Council of Scientific and Industrial Research (CSIR), Chandigarh, India
| | - Manoj Kumar
- Virology Unit and Bioinformatics Centre, Institute of Microbial Technology, Council of Scientific and Industrial Research (CSIR), Chandigarh, India
- Academy of Scientific and Innovative Research (AcSIR), Ghaziabad, India
| |
Collapse
|
34
|
Tarafder S, Islam M, Shatabda S, Rahman A. Figbird: A probabilistic method for filling gaps in genome assemblies. Bioinformatics 2022; 38:3717-3724. [PMID: 35731219 DOI: 10.1093/bioinformatics/btac404] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/30/2021] [Revised: 06/12/2022] [Accepted: 06/17/2022] [Indexed: 11/13/2022] Open
Abstract
MOTIVATION Advances in sequencing technologies have led to the sequencing of genomes of a multitude of organisms. However, draft genomes of many of these organisms contain a large number of gaps due to the repeats in genomes, low sequencing coverage and limitations in sequencing technologies. Although there exist several tools for filling gaps, many of these do not utilize all information relevant to gap filling. RESULTS Here, we present a probabilistic method for filling gaps in draft genome assemblies using second generation reads based on a generative model for sequencing that takes into account information on insert sizes and sequencing errors. Our method is based on the expectation-maximization (EM) algorithm unlike the graph based methods adopted in the literature. Experiments on real biological datasets show that this novel approach can fill up large portions of gaps with small number of errors and misassemblies compared to other state of the art gap filling tools. AVAILABILITY AND IMPLEMENTATION The method is implemented using C ++ in a software named "Filling Gaps by Iterative Read Distribution (Figbird)", which is available at: https://github.com/SumitTarafder/Figbird. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Sumit Tarafder
- Department of Computer Science and Engineering, Bangladesh University of Engineering and Technology, Dhaka, 1205, Bangladesh.,Department of Computer Science and Engineering, United International University, Dhaka, 1212, Bangladesh
| | - Mazharul Islam
- Department of Computer Science and Engineering, Bangladesh University of Engineering and Technology, Dhaka, 1205, Bangladesh.,Department of Computer Science and Engineering, United International University, Dhaka, 1212, Bangladesh
| | - Swakkhar Shatabda
- Department of Computer Science and Engineering, United International University, Dhaka, 1212, Bangladesh
| | - Atif Rahman
- Department of Computer Science and Engineering, Bangladesh University of Engineering and Technology, Dhaka, 1205, Bangladesh
| |
Collapse
|
35
|
Transcriptome Analysis and Identification of a Female-Specific SSR Marker in Pistacia chinensis Based on Illumina Paired-End RNA Sequencing. Genes (Basel) 2022; 13:genes13061024. [PMID: 35741786 PMCID: PMC9222763 DOI: 10.3390/genes13061024] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/11/2022] [Revised: 05/27/2022] [Accepted: 05/31/2022] [Indexed: 02/08/2023] Open
Abstract
Pistacia chinensis Bunge (P. chinensis), a dioecious plant species, has been widely found in China. The female P. chinensis plants are more important than male plants in agricultural production, as their seeds can serve as an ideal feedstock for biodiesel. However, the sex of P. chinensis plants is hard to distinguish during the seedling stage due to the scarcity of available transcriptomic and genomic information. In this work, Illumina paired-end RNA sequencing assay was conducted to unravel the transcriptomic profiles of female and male P. chinensis flower buds. In total, 50,925,088 and 51,470,578 clean reads were obtained from the female and male cDNA libraries, respectively. After quality checks and de novo assembly, a total of 83,370 unigenes with a mean length of 1.3 kb were screened. Overall, 64,539 unigenes (77.48%) could be matched in at least one of the NR, NT, Swiss-Prot, COG, KEGG, and GO databases, 71 of which were putatively related to the floral development of P. chinensis. Additionally, 21,662 simple sequence repeat (SSR) motifs were identified in 17,028 unigenes of P. chinensis, and the mononucleotide motif was the most dominant type of repeats (52.59%) in P. chinensis, followed by dinucleotide (22.29%), trinucleotide (20.15%). The most abundant repeats were AG/CT (13.97%), followed by AAC/GTT (6.75%) and AT/TA (6.10%). Based on these SSR, 983 EST-SSR primers were designed, 151 of which were randomly chosen for validation. Of these validated EST-SSR markers, 25 SSR markers were found to be polymorphic between male and female plants. One SSR marker, namelyPCSSR55, displayed excellent specificity in female plants, which could clearly distinguish between male and female P. chinensis. Altogether, our findings not only reveal that the EST-SSR marker is extremely effective in distinguishing between male and female P. chinensis but also provide a solid framework for sex determination of plant seedlings.
Collapse
|
36
|
Mc Cartney AM, Shafin K, Alonge M, Bzikadze AV, Formenti G, Fungtammasan A, Howe K, Jain C, Koren S, Logsdon GA, Miga KH, Mikheenko A, Paten B, Shumate A, Soto DC, Sović I, Wood JMD, Zook JM, Phillippy AM, Rhie A. Chasing perfection: validation and polishing strategies for telomere-to-telomere genome assemblies. Nat Methods 2022; 19:687-695. [PMID: 35361931 PMCID: PMC9812399 DOI: 10.1038/s41592-022-01440-3] [Citation(s) in RCA: 56] [Impact Index Per Article: 18.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/13/2021] [Accepted: 03/04/2022] [Indexed: 01/07/2023]
Abstract
Advances in long-read sequencing technologies and genome assembly methods have enabled the recent completion of the first telomere-to-telomere human genome assembly, which resolves complex segmental duplications and large tandem repeats, including centromeric satellite arrays in a complete hydatidiform mole (CHM13). Although derived from highly accurate sequences, evaluation revealed evidence of small errors and structural misassemblies in the initial draft assembly. To correct these errors, we designed a new repeat-aware polishing strategy that made accurate assembly corrections in large repeats without overcorrection, ultimately fixing 51% of the existing errors and improving the assembly quality value from 70.2 to 73.9 measured from PacBio high-fidelity and Illumina k-mers. By comparing our results to standard automated polishing tools, we outline common polishing errors and offer practical suggestions for genome projects with limited resources. We also show how sequencing biases in both high-fidelity and Oxford Nanopore Technologies reads cause signature assembly errors that can be corrected with a diverse panel of sequencing technologies.
Collapse
Affiliation(s)
- Ann M Mc Cartney
- Genome Informatics Section, Computational and Statistical Genomics Branch, NHGRI, NIH, Bethesda, MD, USA
| | - Kishwar Shafin
- UC Santa Cruz Genomics Institute, University of California, Santa Cruz, Santa Cruz, CA, USA
| | - Michael Alonge
- Department of Computer Science, Johns Hopkins University, Baltimore, MD, USA
| | - Andrey V Bzikadze
- Graduate Program in Bioinformatics and Systems Biology, University of California, San Diego, La Jolla, CA, USA
| | - Giulio Formenti
- Laboratory of Neurogenetics of Language and The Vertebrate Genome Lab, The Rockefeller University, New York, NY, USA
| | | | | | - Chirag Jain
- Genome Informatics Section, Computational and Statistical Genomics Branch, NHGRI, NIH, Bethesda, MD, USA
- Department of Computational and Data Sciences, Indian Institute of Science, Bangalore, India
| | - Sergey Koren
- Genome Informatics Section, Computational and Statistical Genomics Branch, NHGRI, NIH, Bethesda, MD, USA
| | - Glennis A Logsdon
- Department of Genome Sciences, University of Washington School of Medicine, Seattle, WA, USA
| | - Karen H Miga
- UC Santa Cruz Genomics Institute, University of California, Santa Cruz, Santa Cruz, CA, USA
- Department of Biomolecular Engineering, University of California, Santa Cruz, CA, USA
| | - Alla Mikheenko
- Center for Algorithmic Biotechnology, Institute of Translational Biomedicine, Saint Petersburg State University, Saint Petersburg, Russia
| | - Benedict Paten
- UC Santa Cruz Genomics Institute, University of California, Santa Cruz, Santa Cruz, CA, USA
| | - Alaina Shumate
- Department of Biomedical Engineering, Johns Hopkins University, Baltimore, MD, USA
| | - Daniela C Soto
- Genome Center, MIND Institute, Department of Biochemistry and Molecular Medicine, University of California, Davis, CA, USA
| | - Ivan Sović
- Pacific Biosciences, Menlo Park, CA, USA
- Digital BioLogic d.o.o., Ivanić-Grad, Croatia
| | | | - Justin M Zook
- Biosystems and Biomaterials Division, National Institute of Standards and Technology, Gaithersburg, MD, USA
| | - Adam M Phillippy
- Genome Informatics Section, Computational and Statistical Genomics Branch, NHGRI, NIH, Bethesda, MD, USA.
| | - Arang Rhie
- Genome Informatics Section, Computational and Statistical Genomics Branch, NHGRI, NIH, Bethesda, MD, USA.
| |
Collapse
|
37
|
Li M, Li LM. RegScaf: a regression approach to scaffolding. Bioinformatics 2022; 38:2675-2682. [PMID: 35561180 PMCID: PMC9326850 DOI: 10.1093/bioinformatics/btac174] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/09/2021] [Revised: 02/19/2022] [Accepted: 03/23/2022] [Indexed: 11/13/2022] Open
Abstract
MOTIVATION Crucial to the correctness of a genome assembly is the accuracy of the underlying scaffolds that specify the orders and orientations of contigs together with the gap distances between contigs. The current methods construct scaffolds based on the alignments of 'linking' reads against contigs. We found that some 'optimal' alignments are mistaken due to factors such as the contig boundary effect, particularly in the presence of repeats. Occasionally, the incorrect alignments can even overwhelm the correct ones. The detection of the incorrect linking information is challenging in any existing methods. RESULTS In this study, we present a novel scaffolding method RegScaf. It first examines the distribution of distances between contigs from read alignment by the kernel density. When multiple modes are shown in a density, orientation-supported links are grouped into clusters, each of which defines a linking distance corresponding to a mode. The linear model parameterizes contigs by their positions on the genome; then each linking distance between a pair of contigs is taken as an observation on the difference of their positions. The parameters are estimated by minimizing a global loss function, which is a version of trimmed sum of squares. The least trimmed squares estimate has such a high breakdown value that it can automatically remove the mistaken linking distances. The results on both synthetic and real datasets demonstrate that RegScaf outperforms some popular scaffolders, especially in the accuracy of gap estimates by substantially reducing extremely abnormal errors. Its strength in resolving repeat regions is exemplified by a real case. Its adaptability to large genomes and TGS long reads is validated as well. AVAILABILITY AND IMPLEMENTATION RegScaf is publicly available at https://github.com/lemontealala/RegScaf.git. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Mengtian Li
- National Center of Mathematics and Interdisciplinary Sciences, Academy of Mathematics and Systems Science, Chinese Academy of Sciences, Beijing 100190, China
- University of Chinese Academy of Sciences, Beijing 100049, China
| | - Lei M Li
- To whom correspondence should be addressed.
| |
Collapse
|
38
|
Roos FJM, van Tienderen GS, Wu H, Bordeu I, Vinke D, Albarinos LM, Monfils K, Niesten S, Smits R, Willemse J, Rosmark O, Westergren-Thorsson G, Kunz DJ, de Wit M, French PJ, Vallier L, IJzermans JNM, Bartfai R, Marks H, Simons BD, van Royen ME, Verstegen MMA, van der Laan LJW. Human branching cholangiocyte organoids recapitulate functional bile duct formation. Cell Stem Cell 2022; 29:776-794.e13. [PMID: 35523140 DOI: 10.1016/j.stem.2022.04.011] [Citation(s) in RCA: 15] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/21/2021] [Revised: 02/25/2022] [Accepted: 04/14/2022] [Indexed: 12/13/2022]
Abstract
Human cholangiocyte organoids show great promise for regenerative therapies and in vitro modeling of bile duct development and diseases. However, the cystic organoids lack the branching morphology of intrahepatic bile ducts (IHBDs). Here, we report establishing human branching cholangiocyte organoid (BRCO) cultures. BRCOs self-organize into complex tubular structures resembling the IHBD architecture. Single-cell transcriptomics and functional analysis showed high similarity to primary cholangiocytes, and importantly, the branching growth mimics aspects of tubular development and is dependent on JAG1/NOTCH2 signaling. When applied to cholangiocarcinoma tumor organoids, the morphology changes to an in vitro morphology like primary tumors. Moreover, these branching cholangiocarcinoma organoids (BRCCAOs) better match the transcriptomic profile of primary tumors and showed increased chemoresistance to gemcitabine and cisplatin. In conclusion, BRCOs recapitulate a complex process of branching morphogenesis in vitro. This provides an improved model to study tubular formation, bile duct functionality, and associated biliary diseases.
Collapse
Affiliation(s)
- Floris J M Roos
- Erasmus MC Transplant Institute, University Medical Center Rotterdam, Department of Surgery, Rotterdam, the Netherlands; Wellcome-MRC Cambridge Stem Cell Institute, University of Cambridge, Cambridge, UK; Department of Surgery, University of Cambridge and NIHR Cambridge Biomedical Research Centre, Cambridge, UK
| | - Gilles S van Tienderen
- Erasmus MC Transplant Institute, University Medical Center Rotterdam, Department of Surgery, Rotterdam, the Netherlands
| | - Haoyu Wu
- Radboud University, Department of Molecular Biology, Nijmegen, the Netherlands
| | - Ignacio Bordeu
- Wellcome Trust/Cancer Research UK Gurdon Institute, University of Cambridge, Cambridge, UK; Department of Applied Mathematics and Theoretical Physics, Centre for Mathematical Sciences, University of Cambridge, Cambridge, UK
| | - Dina Vinke
- Erasmus MC Transplant Institute, University Medical Center Rotterdam, Department of Surgery, Rotterdam, the Netherlands
| | - Laura Muñoz Albarinos
- Erasmus MC Transplant Institute, University Medical Center Rotterdam, Department of Surgery, Rotterdam, the Netherlands
| | - Kathryn Monfils
- Erasmus MC Transplant Institute, University Medical Center Rotterdam, Department of Surgery, Rotterdam, the Netherlands
| | - Sabrah Niesten
- Erasmus MC Transplant Institute, University Medical Center Rotterdam, Department of Surgery, Rotterdam, the Netherlands
| | - Ron Smits
- Erasmus MC, University Medical Center Rotterdam, Department of Gastroenterology and Hepatology, Rotterdam, the Netherlands
| | - Jorke Willemse
- Erasmus MC Transplant Institute, University Medical Center Rotterdam, Department of Surgery, Rotterdam, the Netherlands
| | - Oskar Rosmark
- Lung Biology, Department Experimental Medical Science, Lund University, Lund, Sweden
| | | | - Daniel J Kunz
- Wellcome Trust/Cancer Research UK Gurdon Institute, University of Cambridge, Cambridge, UK; Department of Applied Mathematics and Theoretical Physics, Centre for Mathematical Sciences, University of Cambridge, Cambridge, UK; Wellcome Sanger Institute, Wellcome Genome Campus, Hinxton, University of Cambridge, Cambridge, UK
| | - Maurice de Wit
- Erasmus MC, University Medical Center Rotterdam, Department of Pathology, Rotterdam, the Netherlands
| | - Pim J French
- Erasmus MC, University Medical Center Rotterdam, Cancer Treatment Screening Facility, Department of Neurology, Rotterdam, the Netherlands
| | - Ludovic Vallier
- Wellcome-MRC Cambridge Stem Cell Institute, University of Cambridge, Cambridge, UK; Department of Surgery, University of Cambridge and NIHR Cambridge Biomedical Research Centre, Cambridge, UK
| | - Jan N M IJzermans
- Erasmus MC Transplant Institute, University Medical Center Rotterdam, Department of Surgery, Rotterdam, the Netherlands
| | - Richard Bartfai
- Radboud University, Department of Molecular Biology, Nijmegen, the Netherlands
| | - Hendrik Marks
- Radboud University, Department of Molecular Biology, Nijmegen, the Netherlands
| | - Ben D Simons
- Wellcome Trust/Cancer Research UK Gurdon Institute, University of Cambridge, Cambridge, UK; Department of Applied Mathematics and Theoretical Physics, Centre for Mathematical Sciences, University of Cambridge, Cambridge, UK
| | - Martin E van Royen
- Erasmus MC, University Medical Center Rotterdam, Department of Pathology, Rotterdam, the Netherlands
| | - Monique M A Verstegen
- Erasmus MC Transplant Institute, University Medical Center Rotterdam, Department of Surgery, Rotterdam, the Netherlands
| | - Luc J W van der Laan
- Erasmus MC Transplant Institute, University Medical Center Rotterdam, Department of Surgery, Rotterdam, the Netherlands.
| |
Collapse
|
39
|
Abstract
The availability of public genomics data has become essential for modern life sciences research, yet the quality, traceability, and curation of these data have significant impacts on a broad range of microbial genomics research. While microbial genome databases such as NCBI’s RefSeq database leverage the scalability of crowd sourcing for growth, genomics data provenance and authenticity of the source materials used to produce data are not strict requirements. Here, we describe the de novo assembly of 1,113 bacterial genome references produced from authenticated materials sourced from the American Type Culture Collection (ATCC), each with full genomics data provenance relating to bioinformatics methods, quality control, and passage history. Comparative genomics analysis of ATCC standard reference genomes (ASRGs) revealed significant issues with regard to NCBI’s RefSeq bacterial genome assemblies related to completeness, mutations, structure, strain metadata, and gaps in traceability to the original biological source materials. Nearly half of RefSeq assemblies lack details on sample source information, sequencing technology, or bioinformatics methods. Deep curation of these records is not within the scope of NCBI’s core mission in supporting open science, which aims to collect sequence records that are submitted by the public. Nonetheless, we propose that gaps in metadata accuracy and data provenance represent an “elephant in the room” for microbial genomics research. Effectively addressing these issues will require raising the level of accountability for data depositors and acknowledging the need for higher expectations of quality among the researchers whose research depends on accurate and attributable reference genome data. IMPORTANCE The traceability of microbial genomics data to authenticated physical biological materials is not a requirement for depositing these data into public genome databases. This creates significant risks for the reliability and data provenance of these important genomics research resources, the impact of which is not well understood. We sought to investigate this by carrying out a comparative genomics study of 1,113 ATCC standard reference genomes (ASRGs) produced by ATCC from authenticated and traceable materials using the latest sequencing technologies. We found widespread discrepancies in genome assembly quality, genetic variability, and the quality and completeness of the associated metadata among hundreds of reference genomes for ATCC strains found in NCBI’s RefSeq database. We present a comparative analysis of de novo-assembled ASRGs, their respective metadata, and variant analysis using RefSeq genomes as a reference. Although assembly quality in RefSeq has generally improved over time, we found that significant quality issues remain, especially as related to genomic data and metadata provenance. Our work highlights the importance of data authentication and provenance for the microbial genomics community, and underscores the risks of ignoring this issue in the future.
Collapse
|
40
|
Durand BARN, Yahiaoui Martinez A, Baud D, François P, Lavigne JP, Dunyach-Remy C. Comparative genomics analysis of two Helcococcus kunzii strains co-isolated with Staphylococcus aureus from diabetic foot ulcers. Genomics 2022; 114:110365. [PMID: 35413435 DOI: 10.1016/j.ygeno.2022.110365] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/15/2021] [Revised: 03/01/2022] [Accepted: 04/06/2022] [Indexed: 01/14/2023]
Abstract
Helcococcus kunzii is a commensal Gram-positive bacterial species recovered from the human skin microbiota and considered as an opportunistic pathogen. Although little is known about its clinical significance, its increased abundance has been reported in infected wounds, particularly in foot ulcers in persons with diabetes. This species is usually detected in mixed cultures from human specimens and frequently isolated with Staphylococcus aureus. Modulation of staphylococci virulence by H. kunzii has been shown in an infection model of Caenorhabditis elegans. The aim of this study was to compare the genomes of two H. kunzii strains isolated from foot ulcers -isolate H13 and H10 showing high or low impact on S. aureus virulence, respectively- and the H. kunzii ATCC51366 strain. Whole genome analyses revealed some differences between the two strains: length (2.06 Mb (H13) and 2.05 Mb (H10) bp), GC content (29.3% (H13) and 29.5% (H10)) and gene content (1,884 (H13) and 1,786 (H10) predicted genes). The core-proteome phylogenies within the genus characterised H. kunzii H13 and H10 as genetically similar to their ancestor. The main differences between the strains were mainly in sugar-associated transporters and various hypothetical proteins. Five targets were identified as potentially involved in S. aureus virulence modulation in both genomes: the two-component iron export system and three autoinducer-like proteins. Moreover, H13 strain harbours a prophage inserted in 1,261,110-1,295,549 (attL-attR), which is absent in H10 strain. The prophage PhiCD38_2 was previously reported for its ability to modulate secretion profile, reinforcing the autoinducer-like hypothesis. In the future, transcriptomics or metaproteomics approaches could be performed to better characterize the H13 strain and possibly identify the underlying mechanism for S. aureus virulence modulation.
Collapse
Affiliation(s)
- Benjamin A R N Durand
- Bacterial Virulence and Chronic Infections, INSERM U1047, University of Montpellier, Department of Microbiology and Hospital Hygiene, University Hospital Nîmes, 30908 Nîmes, France
| | - Alex Yahiaoui Martinez
- Department of Microbiology and Hospital Hygiene, University Hospital Nîmes, University of Montpellier, 30029 Nîmes, France
| | - Damien Baud
- Department of Infectious Diseases, Genomic Research Laboratory, Geneva University Hospitals, 1205 Geneva, Switzerland
| | - Patrice François
- Department of Infectious Diseases, Genomic Research Laboratory, Geneva University Hospitals, 1205 Geneva, Switzerland
| | - Jean-Philippe Lavigne
- Bacterial Virulence and Chronic Infections, INSERM U1047, University of Montpellier, Department of Microbiology and Hospital Hygiene, University Hospital Nîmes, 30908 Nîmes, France.
| | - Catherine Dunyach-Remy
- Bacterial Virulence and Chronic Infections, INSERM U1047, University of Montpellier, Department of Microbiology and Hospital Hygiene, University Hospital Nîmes, 30908 Nîmes, France
| |
Collapse
|
41
|
Palma F, Mangone I, Janowicz A, Moura A, Chiaverini A, Torresi M, Garofolo G, Criscuolo A, Brisse S, Di Pasquale A, Cammà C, Radomski N. In vitro and in silico parameters for precise cgMLST typing of Listeria monocytogenes. BMC Genomics 2022; 23:235. [PMID: 35346021 PMCID: PMC8961897 DOI: 10.1186/s12864-022-08437-4] [Citation(s) in RCA: 7] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/06/2021] [Accepted: 02/28/2022] [Indexed: 02/02/2023] Open
Abstract
Background Whole genome sequencing analyzed by core genome multi-locus sequence typing (cgMLST) is widely used in surveillance of the pathogenic bacteria Listeria monocytogenes. Given the heterogeneity of available bioinformatics tools to define cgMLST alleles, our aim was to identify parameters influencing the precision of cgMLST profiles. Methods We used three L. monocytogenes reference genomes from different phylogenetic lineages and assessed the impact of in vitro (i.e. tested genomes, successive platings, replicates of DNA extraction and sequencing) and in silico parameters (i.e. targeted depth of coverage, depth of coverage, breadth of coverage, assembly metrics, cgMLST workflows, cgMLST completeness) on cgMLST precision made of 1748 core loci. Six cgMLST workflows were tested, comprising assembly-based (BIGSdb, INNUENDO, GENPAT, SeqSphere and BioNumerics) and assembly-free (i.e. kmer-based MentaLiST) allele callers. Principal component analyses and generalized linear models were used to identify the most impactful parameters on cgMLST precision. Results The isolate’s genetic background, cgMLST workflows, cgMLST completeness, as well as depth and breadth of coverage were the parameters that impacted most on cgMLST precision (i.e. identical alleles against reference circular genomes). All workflows performed well at ≥40X of depth of coverage, with high loci detection (> 99.54% for all, except for BioNumerics with 97.78%) and showed consistent cluster definitions using the reference cut-off of ≤7 allele differences. Conclusions This highlights that bioinformatics workflows dedicated to cgMLST allele calling are largely robust when paired-end reads are of high quality and when the sequencing depth is ≥40X. Supplementary Information The online version contains supplementary material available at 10.1186/s12864-022-08437-4.
Collapse
|
42
|
Grealey J, Lannelongue L, Saw WY, Marten J, Méric G, Ruiz-Carmona S, Inouye M. THE CARBON FOOTPRINT OF BIOINFORMATICS. Mol Biol Evol 2022; 39:6526403. [PMID: 35143670 PMCID: PMC8892942 DOI: 10.1093/molbev/msac034] [Citation(s) in RCA: 21] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Abstract
Bioinformatic research relies on large-scale computational infrastructures which have a nonzero carbon footprint but so far, no study has quantified the environmental costs of bioinformatic tools and commonly run analyses. In this work, we estimate the carbon footprint of bioinformatics (in kilograms of CO2 equivalent units, kgCO2e) using the freely available Green Algorithms calculator (www.green-algorithms.org, last accessed 2022). We assessed 1) bioinformatic approaches in genome-wide association studies (GWAS), RNA sequencing, genome assembly, metagenomics, phylogenetics, and molecular simulations, as well as 2) computation strategies, such as parallelization, CPU (central processing unit) versus GPU (graphics processing unit), cloud versus local computing infrastructure, and geography. In particular, we found that biobank-scale GWAS emitted substantial kgCO2e and simple software upgrades could make it greener, for example, upgrading from BOLT-LMM v1 to v2.3 reduced carbon footprint by 73%. Moreover, switching from the average data center to a more efficient one can reduce carbon footprint by approximately 34%. Memory over-allocation can also be a substantial contributor to an algorithm’s greenhouse gas emissions. The use of faster processors or greater parallelization reduces running time but can lead to greater carbon footprint. Finally, we provide guidance on how researchers can reduce power consumption and minimize kgCO2e. Overall, this work elucidates the carbon footprint of common analyses in bioinformatics and provides solutions which empower a move toward greener research.
Collapse
Affiliation(s)
- Jason Grealey
- Cambridge Baker Systems Genomics Initiative, Baker Heart and Diabetes Institute, Melbourne, Victoria, Australia.,Department of Mathematics and Statistics, La Trobe University, Melbourne, Australia
| | - Loïc Lannelongue
- Cambridge Baker Systems Genomics Initiative, Department of Public Health and Primary Care, University of Cambridge, Cambridge, UK.,British Heart Foundation Cardiovascular Epidemiology Unit, Department of Public Health and Primary Care, University of Cambridge, Cambridge, UK.,Health Data Research UK Cambridge, Wellcome Genome Campus and University of Cambridge, Cambridge, UK
| | - Woei-Yuh Saw
- Cambridge Baker Systems Genomics Initiative, Baker Heart and Diabetes Institute, Melbourne, Victoria, Australia
| | - Jonathan Marten
- British Heart Foundation Cardiovascular Epidemiology Unit, Department of Public Health and Primary Care, University of Cambridge, Cambridge, UK
| | - Guillaume Méric
- Cambridge Baker Systems Genomics Initiative, Baker Heart and Diabetes Institute, Melbourne, Victoria, Australia.,Department of Infectious Diseases, Central Clinical School, Monash University, Melbourne, Australia
| | - Sergio Ruiz-Carmona
- Cambridge Baker Systems Genomics Initiative, Baker Heart and Diabetes Institute, Melbourne, Victoria, Australia
| | - Michael Inouye
- Cambridge Baker Systems Genomics Initiative, Baker Heart and Diabetes Institute, Melbourne, Victoria, Australia.,Cambridge Baker Systems Genomics Initiative, Department of Public Health and Primary Care, University of Cambridge, Cambridge, UK.,British Heart Foundation Cardiovascular Epidemiology Unit, Department of Public Health and Primary Care, University of Cambridge, Cambridge, UK.,Health Data Research UK Cambridge, Wellcome Genome Campus and University of Cambridge, Cambridge, UK.,British Heart Foundation Centre of Research Excellence, University of Cambridge, Cambridge, UK.,The Alan Turing Institute, London, UK
| |
Collapse
|
43
|
In-Depth Analysis of Bacillus anthracis 16S rRNA Genes and Transcripts Reveals Intra- and Intergenomic Diversity and Facilitates Anthrax Detection. mSystems 2022; 7:e0136121. [PMID: 35076271 PMCID: PMC8788319 DOI: 10.1128/msystems.01361-21] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/20/2022] Open
Abstract
Analysis of 16S rRNA (rRNA) genes provides a central means of taxonomic classification of bacterial species. Based on presumed sequence identity among species of the Bacillus cereus sensu lato group, the 16S rRNA genes of B. anthracis have been considered unsuitable for diagnosis of the anthrax pathogen. With the recent identification of a single nucleotide polymorphism in some 16S rRNA gene copies, specific identification of B. anthracis becomes feasible. Here, we designed and evaluated a set of in situ, in vitro, and in silico assays to assess the unknown 16S state of B. anthracis from different perspectives. Using a combination of digital PCR, fluorescence in situ hybridization, long-read genome sequencing, and bioinformatics, we were able to detect and quantify a unique 16S rRNA gene allele of B. anthracis (16S-BA-allele). This allele was found in all available B. anthracis genomes and may facilitate differentiation of the pathogen from any close relative. Bioinformatics analysis of 959 B. anthracis SRA data sets inferred that abundances and genomic arrangements of the 16S-BA-allele and the entire rRNA operon copy numbers differ considerably between strains. Expression ratios of 16S-BA-alleles were proportional to the respective genomic allele copy numbers. The findings and experimental tools presented here provide detailed insights into the intra- and intergenomic diversity of 16S rRNA genes and may pave the way for improved identification of B. anthracis and other pathogens with diverse rRNA operons. IMPORTANCE For severe infectious diseases, precise pathogen detection is crucial for antibiotic therapy and patient survival. Identification of Bacillus anthracis, the causative agent of the zoonosis anthrax, can be challenging when querying specific nucleotide sequences such as in small subunit rRNA (16S rRNA) genes, which are commonly used for typing of bacteria. This study analyzed on a broad genomic scale a cryptic and hitherto underappreciated allelic variant of the bacterium’s 16S rRNA genes and their transcripts using a set of in situ, in vitro, and in silico assays and found significant intra- and intergenomic heterogeneity in the distribution of the allele and overall rRNA operon copy numbers. This allelic variation was uniquely species specific, which enabled sensitive pathogen detection on both DNA and transcript levels. The methodology used here is likely also applicable to other pathogens that are otherwise difficult to discriminate from their less harmful relatives.
Collapse
|
44
|
Li B, Zhang X, Liu Z, Wang L, Song L, Liang X, Dou S, Tu J, Shen J, Yi B, Wen J, Fu T, Dai C, Gao C, Wang A, Ma C. Genetic and Molecular Characterization of a Self-Compatible Brassica rapa Line Possessing a New Class II S Haplotype. PLANTS (BASEL, SWITZERLAND) 2021; 10:plants10122815. [PMID: 34961286 PMCID: PMC8709392 DOI: 10.3390/plants10122815] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 11/04/2021] [Revised: 12/01/2021] [Accepted: 12/03/2021] [Indexed: 05/20/2023]
Abstract
Most flowering plants have evolved a self-incompatibility (SI) system to maintain genetic diversity by preventing self-pollination. The Brassica species possesses sporophytic self-incompatibility (SSI), which is controlled by the pollen- and stigma-determinant factors SP11/SCR and SRK. However, the mysterious molecular mechanism of SI remains largely unknown. Here, a new class II S haplotype, named BrS-325, was identified in a pak choi line '325', which was responsible for the completely self-compatible phenotype. To obtain the entire S locus sequences, a complete pak choi genome was gained through Nanopore sequencing and de novo assembly, which provided a good reference genome for breeding and molecular research in B. rapa. S locus comparative analysis showed that the closest relatives to BrS-325 was BrS-60, and high sequence polymorphism existed in the S locus. Meanwhile, two duplicated SRKs (BrSRK-325a and BrSRK-325b) were distributed in the BrS-325 locus with opposite transcription directions. BrSRK-325b and BrSCR-325 were expressed normally at the transcriptional level. The multiple sequence alignment of SCRs and SRKs in class II S haplotypes showed that a number of amino acid variations were present in the contact regions (CR II and CR III) of BrSCR-325 and the hypervariable regions (HV I and HV II) of BrSRK-325s, which may influence the binding and interaction between the ligand and the receptor. Thus, these results suggested that amino acid variations in contact sites may lead to the SI destruction of a new class II S haplotype BrS-325 in B. rapa. The complete SC phenotype of '325' showed the potential for practical breeding application value in B. rapa.
Collapse
Affiliation(s)
- Bing Li
- National Sub-Center of Rapeseed Improvement in Wuhan, National Key Laboratory of Crop Genetic Improvement, College of Plant Science and Technology, Huazhong Agricultural University, Wuhan 430070, China; (B.L.); (L.W.); (X.L.); (S.D.); (J.T.); (J.S.); (B.Y.); (J.W.); (T.F.); (C.D.)
| | - Xueli Zhang
- Wuhan Vegetable Research Institute, Wuhan Academy of Agricultural Sciences, Wuhan 430345, China; (X.Z.); (L.S.)
| | - Zhiquan Liu
- Hunan Vegetable Research Institute, Hunan Academy of Agricultural Science, Changsha 410125, China;
| | - Lulin Wang
- National Sub-Center of Rapeseed Improvement in Wuhan, National Key Laboratory of Crop Genetic Improvement, College of Plant Science and Technology, Huazhong Agricultural University, Wuhan 430070, China; (B.L.); (L.W.); (X.L.); (S.D.); (J.T.); (J.S.); (B.Y.); (J.W.); (T.F.); (C.D.)
| | - Liping Song
- Wuhan Vegetable Research Institute, Wuhan Academy of Agricultural Sciences, Wuhan 430345, China; (X.Z.); (L.S.)
| | - Xiaomei Liang
- National Sub-Center of Rapeseed Improvement in Wuhan, National Key Laboratory of Crop Genetic Improvement, College of Plant Science and Technology, Huazhong Agricultural University, Wuhan 430070, China; (B.L.); (L.W.); (X.L.); (S.D.); (J.T.); (J.S.); (B.Y.); (J.W.); (T.F.); (C.D.)
| | - Shengwei Dou
- National Sub-Center of Rapeseed Improvement in Wuhan, National Key Laboratory of Crop Genetic Improvement, College of Plant Science and Technology, Huazhong Agricultural University, Wuhan 430070, China; (B.L.); (L.W.); (X.L.); (S.D.); (J.T.); (J.S.); (B.Y.); (J.W.); (T.F.); (C.D.)
| | - Jinxing Tu
- National Sub-Center of Rapeseed Improvement in Wuhan, National Key Laboratory of Crop Genetic Improvement, College of Plant Science and Technology, Huazhong Agricultural University, Wuhan 430070, China; (B.L.); (L.W.); (X.L.); (S.D.); (J.T.); (J.S.); (B.Y.); (J.W.); (T.F.); (C.D.)
| | - Jinxiong Shen
- National Sub-Center of Rapeseed Improvement in Wuhan, National Key Laboratory of Crop Genetic Improvement, College of Plant Science and Technology, Huazhong Agricultural University, Wuhan 430070, China; (B.L.); (L.W.); (X.L.); (S.D.); (J.T.); (J.S.); (B.Y.); (J.W.); (T.F.); (C.D.)
| | - Bin Yi
- National Sub-Center of Rapeseed Improvement in Wuhan, National Key Laboratory of Crop Genetic Improvement, College of Plant Science and Technology, Huazhong Agricultural University, Wuhan 430070, China; (B.L.); (L.W.); (X.L.); (S.D.); (J.T.); (J.S.); (B.Y.); (J.W.); (T.F.); (C.D.)
| | - Jing Wen
- National Sub-Center of Rapeseed Improvement in Wuhan, National Key Laboratory of Crop Genetic Improvement, College of Plant Science and Technology, Huazhong Agricultural University, Wuhan 430070, China; (B.L.); (L.W.); (X.L.); (S.D.); (J.T.); (J.S.); (B.Y.); (J.W.); (T.F.); (C.D.)
| | - Tingdong Fu
- National Sub-Center of Rapeseed Improvement in Wuhan, National Key Laboratory of Crop Genetic Improvement, College of Plant Science and Technology, Huazhong Agricultural University, Wuhan 430070, China; (B.L.); (L.W.); (X.L.); (S.D.); (J.T.); (J.S.); (B.Y.); (J.W.); (T.F.); (C.D.)
| | - Cheng Dai
- National Sub-Center of Rapeseed Improvement in Wuhan, National Key Laboratory of Crop Genetic Improvement, College of Plant Science and Technology, Huazhong Agricultural University, Wuhan 430070, China; (B.L.); (L.W.); (X.L.); (S.D.); (J.T.); (J.S.); (B.Y.); (J.W.); (T.F.); (C.D.)
| | - Changbin Gao
- Wuhan Vegetable Research Institute, Wuhan Academy of Agricultural Sciences, Wuhan 430345, China; (X.Z.); (L.S.)
- Correspondence: (C.G.); (A.W.); (C.M.); Tel.: +86-27-8728-18-07 (C.M.)
| | - Aihua Wang
- Wuhan Vegetable Research Institute, Wuhan Academy of Agricultural Sciences, Wuhan 430345, China; (X.Z.); (L.S.)
- Correspondence: (C.G.); (A.W.); (C.M.); Tel.: +86-27-8728-18-07 (C.M.)
| | - Chaozhi Ma
- National Sub-Center of Rapeseed Improvement in Wuhan, National Key Laboratory of Crop Genetic Improvement, College of Plant Science and Technology, Huazhong Agricultural University, Wuhan 430070, China; (B.L.); (L.W.); (X.L.); (S.D.); (J.T.); (J.S.); (B.Y.); (J.W.); (T.F.); (C.D.)
- Correspondence: (C.G.); (A.W.); (C.M.); Tel.: +86-27-8728-18-07 (C.M.)
| |
Collapse
|
45
|
Wagner DD, Carleton HA, Trees E, Katz LS. Evaluating whole-genome sequencing quality metrics for enteric pathogen outbreaks. PeerJ 2021; 9:e12446. [PMID: 34900416 PMCID: PMC8627651 DOI: 10.7717/peerj.12446] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/11/2021] [Accepted: 10/18/2021] [Indexed: 11/25/2022] Open
Abstract
Background Whole genome sequencing (WGS) has gained increasing importance in responses to enteric bacterial outbreaks. Common analysis procedures for WGS, single nucleotide polymorphisms (SNPs) and genome assembly, are highly dependent upon WGS data quality. Methods Raw, unprocessed WGS reads from Escherichia coli, Salmonella enterica, and Shigella sonnei outbreak clusters were characterized for four quality metrics: PHRED score, read length, library insert size, and ambiguous nucleotide composition. PHRED scores were strongly correlated with improved SNPs analysis results in E. coli and S. enterica clusters. Results Assembly quality showed only moderate correlations with PHRED scores and library insert size, and then only for Salmonella. To improve SNP analyses and assemblies, we compared seven read-healing pipelines to improve these four quality metrics and to see how well they improved SNP analysis and genome assembly. The most effective read healing pipelines for SNPs analysis incorporated quality-based trimming, fixed-width trimming, or both. The Lyve-SET SNPs pipeline showed a more marked improvement than the CFSAN SNP Pipeline, but the latter performed better on raw, unhealed reads. For genome assembly, SPAdes enabled significant improvements in healed E. coli reads only, while Skesa yielded no significant improvements on healed reads. Conclusions PHRED scores will continue to be a crucial quality metric albeit not of equal impact across all types of analyses for all enteric bacteria. While trimming-based read healing performed well for SNPs analyses, different read healing approaches are likely needed for genome assembly or other, emerging WGS analysis methodologies.
Collapse
Affiliation(s)
- Darlene D Wagner
- Division of Viral Diseases, Centers for Disease Control and Prevention, Atlanta, GA, United States of America.,Eagle Medical Services, LLC, Atlanta, GA, United States of America
| | - Heather A Carleton
- Enteric Diseases Laboratory Branch, Centers for Disease Control and Prevention, Atlanta, GA, United States of America
| | - Eija Trees
- Association of Public Health Laboratories, Silver Spring, MD, United States of America
| | - Lee S Katz
- Enteric Diseases Laboratory Branch, Centers for Disease Control and Prevention, Atlanta, GA, United States of America.,Center for Food Safety, University of Georgia, Griffin, GA, United States of America
| |
Collapse
|
46
|
MacDonald ML, Lee KH. EvalDNA: a machine learning-based tool for the comprehensive evaluation of mammalian genome assembly quality. BMC Bioinformatics 2021; 22:570. [PMID: 34837948 PMCID: PMC8627028 DOI: 10.1186/s12859-021-04480-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/05/2020] [Accepted: 11/15/2021] [Indexed: 11/16/2022] Open
Abstract
Background To select the most complete, continuous, and accurate assembly for an organism of interest, comprehensive quality assessment of assemblies is necessary. We present a novel tool, called Evaluation of De Novo Assemblies (EvalDNA), which uses supervised machine learning for the quality scoring of genome assemblies and does not require an existing reference genome for accuracy assessment. Results EvalDNA calculates a list of quality metrics from an assembled sequence and applies a model created from supervised machine learning methods to integrate various metrics into a comprehensive quality score. A well-tested, accurate model for scoring mammalian genome sequences is provided as part of EvalDNA. This random forest regression model evaluates an assembled sequence based on continuity, completeness, and accuracy, and was able to explain 86% of the variation in reference-based quality scores within the testing data. EvalDNA was applied to human chromosome 14 assemblies from the GAGE study to rank genome assemblers and to compare EvalDNA to two other quality evaluation tools. In addition, EvalDNA was used to evaluate several genome assemblies of the Chinese hamster genome to help establish a better reference genome for the biopharmaceutical manufacturing community. EvalDNA was also used to assess more recent human assemblies from the QUAST-LG study completed in 2018, and its ability to score bacterial genomes was examined through application on bacterial assemblies from the GAGE-B study. Conclusions EvalDNA enables scientists to easily identify the best available genome assembly for their organism of interest without requiring a reference assembly. EvalDNA sets itself apart from other quality assessment tools by producing a quality score that enables direct comparison among assemblies from different species. Supplementary Information The online version contains supplementary material available at 10.1186/s12859-021-04480-2.
Collapse
Affiliation(s)
- Madolyn L MacDonald
- Center for Bioinformatics and Computational Biology, University of Delaware, 15 Innovation Way, Newark, 19711, USA.,Department of Computer and Information Sciences, University of Delaware, 18 Amstel Ave., Newark, 19716, USA.,Delaware Biotechnology Institute, University of Delaware, 15 Innovation Way, Newark, 19711, USA
| | - Kelvin H Lee
- Delaware Biotechnology Institute, University of Delaware, 15 Innovation Way, Newark, 19711, USA. .,Department of Chemical and Biomolecular Engineering, University of Delaware, 150 Academy Street, Newark, 19716, USA.
| |
Collapse
|
47
|
Rahman A, Pachter L. SWALO: scaffolding with assembly likelihood optimization. Nucleic Acids Res 2021; 49:e117. [PMID: 34417615 PMCID: PMC8599790 DOI: 10.1093/nar/gkab717] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/22/2020] [Revised: 06/16/2021] [Accepted: 08/16/2021] [Indexed: 01/01/2023] Open
Abstract
Scaffolding, i.e. ordering and orienting contigs is an important step in genome assembly. We present a method for scaffolding using second generation sequencing reads based on likelihoods of genome assemblies. A generative model for sequencing is used to obtain maximum likelihood estimates of gaps between contigs and to estimate whether linking contigs into scaffolds would lead to an increase in the likelihood of the assembly. We then link contigs if they can be unambiguously joined or if the corresponding increase in likelihood is substantially greater than that of other possible joins of those contigs. The method is implemented in a tool called Swalo with approximations to make it efficient and applicable to large datasets. Analysis on real and simulated datasets reveals that it consistently makes more or similar number of correct joins as other scaffolders while linking very few contigs incorrectly, thus outperforming other scaffolders and demonstrating that substantial improvement in genome assembly may be achieved through the use of statistical models. Swalo is freely available for download at https://atifrahman.github.io/SWALO/.
Collapse
Affiliation(s)
- Atif Rahman
- Department of Electrical Engineering and Computer Sciences, University of California, Berkeley, CA 94720, USA.,Department of Computer Science and Engineering, Bangladesh University of Engineering and Technology, Dhaka 1205, Bangladesh
| | - Lior Pachter
- Department of Electrical Engineering and Computer Sciences, University of California, Berkeley, CA 94720, USA.,Departments of Mathematics and Molecular & Cell Biology, University of California, Berkeley, CA 94720, USA.,Departments of Biology and Computing & Mathematical Sciences, California Institute of Technology, Pasadena, CA 91103, USA
| |
Collapse
|
48
|
Sahlin K. Effective sequence similarity detection with strobemers. Genome Res 2021; 31:2080-2094. [PMID: 34667119 PMCID: PMC8559714 DOI: 10.1101/gr.275648.121] [Citation(s) in RCA: 11] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/13/2021] [Accepted: 08/20/2021] [Indexed: 01/08/2023]
Abstract
k-mer-based methods are widely used in bioinformatics for various types of sequence comparisons. However, a single mutation will mutate k consecutive k-mers and make most k-mer-based applications for sequence comparison sensitive to variable mutation rates. Many techniques have been studied to overcome this sensitivity, for example, spaced k-mers and k-mer permutation techniques, but these techniques do not handle indels well. For indels, pairs or groups of small k-mers are commonly used, but these methods first produce k-mer matches, and only in a second step, a pairing or grouping of k-mers is performed. Such techniques produce many redundant k-mer matches owing to the size of k Here, we propose strobemers as an alternative to k-mers for sequence comparison. Intuitively, strobemers consist of two or more linked shorter k-mers, where the combination of linked k-mers is decided by a hash function. We use simulated data to show that strobemers provide more evenly distributed sequence matches and are less sensitive to different mutation rates than k-mers and spaced k-mers. Strobemers also produce higher match coverage across sequences. We further implement a proof-of-concept sequence-matching tool StrobeMap and use synthetic and biological Oxford Nanopore sequencing data to show the utility of using strobemers for sequence comparison in different contexts such as sequence clustering and alignment scenarios.
Collapse
Affiliation(s)
- Kristoffer Sahlin
- Department of Mathematics, Science for Life Laboratory, Stockholm University, 10691 Stockholm, Sweden
| |
Collapse
|
49
|
MOSGA 2: Comparative genomics and validation tools. Comput Struct Biotechnol J 2021; 19:5504-5509. [PMID: 34712396 PMCID: PMC8517542 DOI: 10.1016/j.csbj.2021.09.024] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/10/2021] [Revised: 09/23/2021] [Accepted: 09/24/2021] [Indexed: 01/06/2023] Open
Abstract
Due to the highly growing number of available genomic information, the need for accessible and easy-to-use analysis tools is increasing. To facilitate eukaryotic genome annotations, we created MOSGA. In this work, we show how MOSGA 2 is developed by including several advanced analyses for genomic data. Since the genomic data quality greatly impacts the annotation quality, we included multiple tools to validate and ensure high-quality user-submitted genome assemblies. Moreover, thanks to the integration of comparative genomics methods, users can benefit from a broader genomic view by analyzing multiple genomic data sets simultaneously. Further, we demonstrate the new functionalities of MOSGA 2 by different use-cases and practical examples. MOSGA 2 extends the already established application to the quality control of the genomic data and integrates and analyzes multiple genomes in a larger context, e.g., by phylogenetics.
Collapse
|
50
|
Wang J, Chen K, Ren Q, Zhang Y, Liu J, Wang G, Liu A, Li Y, Liu G, Luo J, Miao W, Xiong J, Yin H, Guan G. Systematic Comparison of the Performances of De Novo Genome Assemblers for Oxford Nanopore Technology Reads From Piroplasm. Front Cell Infect Microbiol 2021; 11:696669. [PMID: 34485177 PMCID: PMC8415751 DOI: 10.3389/fcimb.2021.696669] [Citation(s) in RCA: 12] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/17/2021] [Accepted: 06/29/2021] [Indexed: 01/06/2023] Open
Abstract
Background Emerging long reads sequencing technology has greatly changed the landscape of whole-genome sequencing, enabling scientists to contribute to decoding the genetic information of non-model species. The sequences generated by PacBio or Oxford Nanopore Technology (ONT) be assembled de novo before further analyses. Some genome de novo assemblers have been developed to assemble long reads generated by ONT. The performance of these assemblers has not been completely investigated. However, genome assembly is still a challenging task. Methods and Results We systematically evaluated the performance of nine de novo assemblers for ONT on different coverage depth datasets. Several metrics were measured to determine the performance of these tools, including N50 length, sequence coverage, runtime, easy operation, accuracy of genome and genomic completeness in varying depths of coverage. Based on the results of our assessments, the performances of these tools are summarized as follows: 1) Coverage depth has a significant effect on genome quality; 2) The level of contiguity of the assembled genome varies dramatically among different de novo tools; 3) The correctness of an assembled genome is closely related to the completeness of the genome. More than 30× nanopore data can be assembled into a relatively complete genome, the quality of which is highly dependent on the polishing using next generation sequencing data. Conclusion Considering the results of our investigation, the advantage and disadvantage of each tool are summarized and guidelines of selecting assembly tools are provided under specific conditions.
Collapse
Affiliation(s)
- Jinming Wang
- State Key Laboratory of Veterinary Etiological Biology, Key Laboratory of Veterinary Parasitology of Gansu Province, Lanzhou Veterinary Research Institute, Chinese Academy of Agricultural Science, Lanzhou, China
| | - Kai Chen
- Key Laboratory of Aquatic Biodiversity and Conservation, Institute of Hydrobiology, Chinese Academy of Sciences, Wuhan, China
| | - Qiaoyun Ren
- State Key Laboratory of Veterinary Etiological Biology, Key Laboratory of Veterinary Parasitology of Gansu Province, Lanzhou Veterinary Research Institute, Chinese Academy of Agricultural Science, Lanzhou, China
| | - Ying Zhang
- Key Laboratory of Functional Genomics and Molecular Diagnosis, Lanzhou Baiyuan Gene Technology Co., Ltd, Lanzhou, China
| | - Junlong Liu
- State Key Laboratory of Veterinary Etiological Biology, Key Laboratory of Veterinary Parasitology of Gansu Province, Lanzhou Veterinary Research Institute, Chinese Academy of Agricultural Science, Lanzhou, China
| | - Guangying Wang
- Key Laboratory of Aquatic Biodiversity and Conservation, Institute of Hydrobiology, Chinese Academy of Sciences, Wuhan, China
| | - Aihong Liu
- State Key Laboratory of Veterinary Etiological Biology, Key Laboratory of Veterinary Parasitology of Gansu Province, Lanzhou Veterinary Research Institute, Chinese Academy of Agricultural Science, Lanzhou, China
| | - Youquan Li
- State Key Laboratory of Veterinary Etiological Biology, Key Laboratory of Veterinary Parasitology of Gansu Province, Lanzhou Veterinary Research Institute, Chinese Academy of Agricultural Science, Lanzhou, China
| | - Guangyuan Liu
- State Key Laboratory of Veterinary Etiological Biology, Key Laboratory of Veterinary Parasitology of Gansu Province, Lanzhou Veterinary Research Institute, Chinese Academy of Agricultural Science, Lanzhou, China
| | - Jianxun Luo
- State Key Laboratory of Veterinary Etiological Biology, Key Laboratory of Veterinary Parasitology of Gansu Province, Lanzhou Veterinary Research Institute, Chinese Academy of Agricultural Science, Lanzhou, China
| | - Wei Miao
- Key Laboratory of Aquatic Biodiversity and Conservation, Institute of Hydrobiology, Chinese Academy of Sciences, Wuhan, China
| | - Jie Xiong
- Key Laboratory of Aquatic Biodiversity and Conservation, Institute of Hydrobiology, Chinese Academy of Sciences, Wuhan, China
| | - Hong Yin
- State Key Laboratory of Veterinary Etiological Biology, Key Laboratory of Veterinary Parasitology of Gansu Province, Lanzhou Veterinary Research Institute, Chinese Academy of Agricultural Science, Lanzhou, China.,Jiangsu Co-Innovation Center for the Prevention and Control of Important Animal Infectious Disease and Zoonoses, Yangzhou University, Yangzhou, China
| | - Guiquan Guan
- State Key Laboratory of Veterinary Etiological Biology, Key Laboratory of Veterinary Parasitology of Gansu Province, Lanzhou Veterinary Research Institute, Chinese Academy of Agricultural Science, Lanzhou, China
| |
Collapse
|