1
|
Boulton W, Fidan FR, Denise H, De Maio N, Goldman N. SWAMPy: simulating SARS-CoV-2 wastewater amplicon metagenomes. BIOINFORMATICS (OXFORD, ENGLAND) 2024; 40:btae532. [PMID: 39226177 PMCID: PMC11401744 DOI: 10.1093/bioinformatics/btae532] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 07/25/2023] [Revised: 06/19/2024] [Accepted: 08/31/2024] [Indexed: 09/05/2024]
Abstract
MOTIVATION Tracking SARS-CoV-2 variants through genomic sequencing has been an important part of the global response to the pandemic and remains a useful tool for surveillance of the virus. As well as whole-genome sequencing of clinical samples, this surveillance effort has been aided by amplicon sequencing of wastewater samples, which proved effective in real case studies. Because of its relevance to public healthcare decisions, testing and benchmarking wastewater sequencing analysis methods is also crucial, which necessitates a simulator. Although metagenomic simulators exist, none is fit for the purpose of simulating the metagenomes produced through amplicon sequencing of wastewater. RESULTS Our new simulation tool, SWAMPy (Simulating SARS-CoV-2 Wastewater Amplicon Metagenomes with Python), is intended to provide realistic simulated SARS-CoV-2 wastewater sequencing datasets with which other programs that rely on this type of data can be evaluated and improved. Our tool is suitable for simulating Illumina short-read RT-PCR amplified metagenomes. AVAILABILITY AND IMPLEMENTATION The code for this project is available at https://github.com/goldman-gp-ebi/SWAMPy. It can be installed on any Unix-based operating system and is available under the GPL-v3 license.
Collapse
Affiliation(s)
- William Boulton
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Hinxton, Cambs CB10 1SD, United Kingdom
- Department of Computing Sciences, University of East Anglia, Norwich, Norfolk NR4 7TJ, United Kingdom
| | - Fatma Rabia Fidan
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Hinxton, Cambs CB10 1SD, United Kingdom
- Department of Biological Sciences, Middle East Technical University, Ankara 06800, Turkey
- Cancer Dynamics Laboratory, Francis Crick Institute, London NW1 1AT, United Kingdom
| | - Hubert Denise
- Department of Health and Social Care, UK Health Security Agency, London SW1P 3HX, United Kingdom
| | - Nicola De Maio
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Hinxton, Cambs CB10 1SD, United Kingdom
| | - Nick Goldman
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Hinxton, Cambs CB10 1SD, United Kingdom
| |
Collapse
|
2
|
Xu Y, Liu D, Han P, Wang H, Wang S, Gao J, Chen F, Zhou X, Deng K, Luo J, Zhou M, Kuang D, Yang F, Jiang Z, Xu S, Rao G, Wang Y, Qu J. Rapid inference of antibiotic resistance and susceptibility for Klebsiella pneumoniae by clinical shotgun metagenomic sequencing. Int J Antimicrob Agents 2024; 64:107252. [PMID: 38908534 DOI: 10.1016/j.ijantimicag.2024.107252] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/19/2024] [Revised: 05/14/2024] [Accepted: 06/07/2024] [Indexed: 06/24/2024]
Abstract
OBJECTIVES The study aimed to develop a genotypic antimicrobial resistance testing method for Klebsiella pneumoniae using metagenomic sequencing data. METHODS We utilized Lasso regression on assembled genomes to identify genetic resistance determinants for six antibiotics (Gentamicin, Tobramycin, Imipenem, Meropenem, Ceftazidime, Trimethoprim/Sulfamethoxazole). The genetic features were weighted, grouped into clusters to establish classifier models. Origin species of detected antibiotic resistant gene (ARG) was determined by novel strategy integrating "possible species," "gene copy number calculation" and "species-specific kmers." The performance of the method was evaluated on retrospective case studies. RESULTS Our study employed machine learning on 3928 K. pneumoniae isolates, yielding stable models with AUCs > 0.9 for various antibiotics. GenseqAMR, a read-based software, exhibited high accuracy (AUC 0.926-0.956) for short-read datasets. The integration of a species-specific kmer strategy significantly improved ARG-species attribution to an average accuracy of 96.67%. In a retrospective study of 191 K. pneumoniae-positive clinical specimens (0.68-93.39% genome coverage), GenseqAMR predicted 84.23% of AST results on average. It demonstrated 88.76-96.26% accuracy for resistance prediction, offering genotypic AST results with a shorter turnaround time (mean ± SD: 18.34 ± 0.87 hours) than traditional culture-based AST (60.15 ± 21.58 hours). Furthermore, a retrospective clinical case study involving 63 cases showed that GenseqAMR could lead to changes in clinical treatment for 24 (38.10%) cases, with 95.83% (23/24) of these changes deemed beneficial. CONCLUSIONS In conclusion, GenseqAMR is a promising tool for quick and accurate AMR prediction in Klebsiella pneumoniae, with the potential to improve patient outcomes through timely adjustments in antibiotic treatment.
Collapse
Affiliation(s)
- Yanping Xu
- Department of Pulmonary and Critical Care Medicine, Ruijin Hospital, Shanghai Jiao Tong University School of Medicine, Shanghai, China; Institute of Respiratory Diseases, Shanghai Jiao Tong University School of Medicine, Shanghai, China; Shanghai Key Laboratory of Emergency Prevention, Diagnosis and Treatment of Respiratory Infectious Diseases, Shanghai, China
| | - Donglai Liu
- National Institutes for Food and Drug Control, Beijing, China
| | - Peng Han
- Genskey Medical Technology Co., Ltd, Beijing, China
| | - Hao Wang
- National Institutes for Food and Drug Control, Beijing, China
| | - Shanmei Wang
- Henan Provincial People's Hospital, People's Hospital of Zhengzhou University, Zhengzhou, Henan, China
| | - Jianpeng Gao
- Genskey Medical Technology Co., Ltd, Beijing, China
| | | | - Xun Zhou
- Institute of Antibiotics, Huashan Hospital, Fudan University, Shanghai, China; Key Laboratory of Clinical Pharmacology of Antibiotics, Ministry of Health, Shanghai, China
| | - Kun Deng
- Department of Laboratory Medicine, The Third Affiliated Hospital of Chongqing Medical University, Chongqing, China
| | - Jiajie Luo
- Genskey Medical Technology Co., Ltd, Beijing, China
| | - Min Zhou
- Department of Pulmonary and Critical Care Medicine, Ruijin Hospital, Shanghai Jiao Tong University School of Medicine, Shanghai, China; Institute of Respiratory Diseases, Shanghai Jiao Tong University School of Medicine, Shanghai, China; Shanghai Key Laboratory of Emergency Prevention, Diagnosis and Treatment of Respiratory Infectious Diseases, Shanghai, China
| | - Dai Kuang
- Department of Pulmonary and Critical Care Medicine, Ruijin Hospital, Shanghai Jiao Tong University School of Medicine, Shanghai, China; Institute of Respiratory Diseases, Shanghai Jiao Tong University School of Medicine, Shanghai, China; Shanghai Key Laboratory of Emergency Prevention, Diagnosis and Treatment of Respiratory Infectious Diseases, Shanghai, China
| | - Fan Yang
- Institute of Antibiotics, Huashan Hospital, Fudan University, Shanghai, China; Key Laboratory of Clinical Pharmacology of Antibiotics, Ministry of Health, Shanghai, China
| | - Zhi Jiang
- Genskey Medical Technology Co., Ltd, Beijing, China
| | - Sihong Xu
- National Institutes for Food and Drug Control, Beijing, China.
| | - Guanhua Rao
- Genskey Medical Technology Co., Ltd, Beijing, China.
| | - Youchun Wang
- National Institutes for Food and Drug Control, Beijing, China.
| | - Jieming Qu
- Department of Pulmonary and Critical Care Medicine, Ruijin Hospital, Shanghai Jiao Tong University School of Medicine, Shanghai, China; Institute of Respiratory Diseases, Shanghai Jiao Tong University School of Medicine, Shanghai, China; Shanghai Key Laboratory of Emergency Prevention, Diagnosis and Treatment of Respiratory Infectious Diseases, Shanghai, China.
| |
Collapse
|
3
|
Lai J, Yang Y, Liu Y, Scharpf RB, Karchin R. Assessing the merits: an opinion on the effectiveness of simulation techniques in tumor subclonal reconstruction. BIOINFORMATICS ADVANCES 2024; 4:vbae094. [PMID: 38948008 PMCID: PMC11213631 DOI: 10.1093/bioadv/vbae094] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 04/12/2024] [Revised: 05/28/2024] [Accepted: 06/15/2024] [Indexed: 07/02/2024]
Abstract
Summary Neoplastic tumors originate from a single cell, and their evolution can be traced through lineages characterized by mutations, copy number alterations, and structural variants. These lineages are reconstructed and mapped onto evolutionary trees with algorithmic approaches. However, without ground truth benchmark sets, the validity of an algorithm remains uncertain, limiting potential clinical applicability. With a growing number of algorithms available, there is urgent need for standardized benchmark sets to evaluate their merits. Benchmark sets rely on in silico simulations of tumor sequence, but there are no accepted standards for simulation tools, presenting a major obstacle to progress in this field. Availability and implementation All analysis done in the paper was based on publicly available data from the publication of each accessed tool.
Collapse
Affiliation(s)
- Jiaying Lai
- Institute for Computational Medicine, Johns Hopkins University, Baltimore, MD 21218, United States
| | - Yi Yang
- Institute for Computational Medicine, Johns Hopkins University, Baltimore, MD 21218, United States
| | - Yunzhou Liu
- Institute for Computational Medicine, Johns Hopkins University, Baltimore, MD 21218, United States
| | - Robert B Scharpf
- Sidney Kimmel Comprehensive Cancer Center, Johns Hopkins University School of Medicine, Baltimore, MD 21231, United States
- Department of Oncology, Johns Hopkins Medical Institutions, Baltimore, MD 21231, United States
| | - Rachel Karchin
- Institute for Computational Medicine, Johns Hopkins University, Baltimore, MD 21218, United States
- Sidney Kimmel Comprehensive Cancer Center, Johns Hopkins University School of Medicine, Baltimore, MD 21231, United States
- Department of Oncology, Johns Hopkins Medical Institutions, Baltimore, MD 21231, United States
- Department of Biomedical Engineering, Johns Hopkins University, Baltimore, MD 21218, United States
| |
Collapse
|
4
|
Gamaarachchi H, Ferguson JM, Samarakoon H, Liyanage K, Deveson IW. Simulation of nanopore sequencing signal data with tunable parameters. Genome Res 2024; 34:778-783. [PMID: 38692839 PMCID: PMC11216307 DOI: 10.1101/gr.278730.123] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/14/2023] [Accepted: 04/24/2024] [Indexed: 05/03/2024]
Abstract
In silico simulation of high-throughput sequencing data is a technique used widely in the genomics field. However, there is currently a lack of effective tools for creating simulated data from nanopore sequencing devices, which measure DNA or RNA molecules in the form of time-series current signal data. Here, we introduce Squigulator, a fast and simple tool for simulation of realistic nanopore signal data. Squigulator takes a reference genome, a transcriptome, or read sequences, and generates corresponding raw nanopore signal data. This is compatible with basecalling software from Oxford Nanopore Technologies (ONT) and other third-party tools, thereby providing a useful substrate for development, testing, debugging, validation, and optimization at every stage of a nanopore analysis workflow. The user may generate data with preset parameters emulating specific ONT protocols or noise-free "ideal" data, or they may deterministically modify a range of experimental variables and/or noise parameters to shape the data to their needs. We present a brief example of Squigulator's use, creating simulated data to model the degree to which different parameters impact the accuracy of ONT basecalling and downstream variant detection. This analysis reveals new insights into the nature of ONT data and basecalling algorithms. We provide Squigulator as an open-source tool for the nanopore community.
Collapse
Affiliation(s)
- Hasindu Gamaarachchi
- School of Computer Science and Engineering, University of New South Wales, Sydney, New South Wales 2052, Australia;
- Genomics and Inherited Disease Program, Garvan Institute of Medical Research, Sydney, New South Wales 2010, Australia
- Centre for Population Genomics, Garvan Institute of Medical Research and Murdoch Children's Research Institute, New South Wales 2010, Australia Australia
| | - James M Ferguson
- Genomics and Inherited Disease Program, Garvan Institute of Medical Research, Sydney, New South Wales 2010, Australia
- Centre for Population Genomics, Garvan Institute of Medical Research and Murdoch Children's Research Institute, New South Wales 2010, Australia Australia
| | - Hiruna Samarakoon
- School of Computer Science and Engineering, University of New South Wales, Sydney, New South Wales 2052, Australia
- Genomics and Inherited Disease Program, Garvan Institute of Medical Research, Sydney, New South Wales 2010, Australia
- Centre for Population Genomics, Garvan Institute of Medical Research and Murdoch Children's Research Institute, New South Wales 2010, Australia Australia
| | - Kisaru Liyanage
- School of Computer Science and Engineering, University of New South Wales, Sydney, New South Wales 2052, Australia
- Genomics and Inherited Disease Program, Garvan Institute of Medical Research, Sydney, New South Wales 2010, Australia
- Centre for Population Genomics, Garvan Institute of Medical Research and Murdoch Children's Research Institute, New South Wales 2010, Australia Australia
| | - Ira W Deveson
- Genomics and Inherited Disease Program, Garvan Institute of Medical Research, Sydney, New South Wales 2010, Australia;
- Centre for Population Genomics, Garvan Institute of Medical Research and Murdoch Children's Research Institute, New South Wales 2010, Australia Australia
- St Vincent's Clinical School, Faculty of Medicine, University of New South Wales, Sydney, New South Wales 2052, Australia
| |
Collapse
|
5
|
Popitsch N, Neumann T, von Haeseler A, Ameres SL. Splice_sim: a nucleotide conversion-enabled RNA-seq simulation and evaluation framework. Genome Biol 2024; 25:166. [PMID: 38918865 DOI: 10.1186/s13059-024-03313-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/17/2023] [Accepted: 06/17/2024] [Indexed: 06/27/2024] Open
Abstract
Nucleotide conversion RNA sequencing techniques interrogate chemical RNA modifications in cellular transcripts, resulting in mismatch-containing reads. Biases in mapping the resulting reads to reference genomes remain poorly understood. We present splice_sim, a splice-aware RNA-seq simulation and evaluation pipeline that introduces user-defined nucleotide conversions at set frequencies, creates mixture models of converted and unconverted reads, and calculates mapping accuracies per genomic annotation. By simulating nucleotide conversion RNA-seq datasets under realistic experimental conditions, including metabolic RNA labeling and RNA bisulfite sequencing, we measure mapping accuracies of state-of-the-art spliced-read mappers for mouse and human transcripts and derive strategies to prevent biases in the data interpretation.
Collapse
Affiliation(s)
- Niko Popitsch
- Max Perutz Labs, Vienna Biocenter Campus (VBC), Vienna, A-1030, Austria.
- Max Perutz Labs, Department of Biochemistry and Cell Biology, University of Vienna, Vienna, A-1030, Austria.
| | - Tobias Neumann
- Quantro Therapeutics, Vienna, A-1030, Austria
- Vienna Biocenter PhD Program, a Doctoral School of the University of Vienna and Medical University of Vienna, Vienna, A-1030, Austria
- Center for Integrative Bioinformatics Vienna, Max Perutz Labs, University of Vienna, Medical University of Vienna, Vienna, A-1030, Austria
| | - Arndt von Haeseler
- Center for Integrative Bioinformatics Vienna, Max Perutz Labs, University of Vienna, Medical University of Vienna, Vienna, A-1030, Austria
- Bioinformatics and Computational Biology, Faculty of Computer Science, University of Vienna, Vienna, A-1090, Austria
| | - Stefan L Ameres
- Max Perutz Labs, Vienna Biocenter Campus (VBC), Vienna, A-1030, Austria
- Max Perutz Labs, Department of Biochemistry and Cell Biology, University of Vienna, Vienna, A-1030, Austria
- Institute of Molecular Biotechnology, IMBA, Vienna Biocenter Campus (VBC), Vienna, A-1030, Austria
| |
Collapse
|
6
|
Yu M, Tang X, Li Z, Wang W, Wang S, Li M, Yu Q, Xie S, Zuo X, Chen C. High-throughput DNA synthesis for data storage. Chem Soc Rev 2024; 53:4463-4489. [PMID: 38498347 DOI: 10.1039/d3cs00469d] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 03/20/2024]
Abstract
With the explosion of digital world, the dramatically increasing data volume is expected to reach 175 ZB (1 ZB = 1012 GB) in 2025. Storing such huge global data would consume tons of resources. Fortunately, it has been found that the deoxyribonucleic acid (DNA) molecule is the most compact and durable information storage medium in the world so far. Its high coding density and long-term preservation properties make itself one of the best data storage carriers for the future. High-throughput DNA synthesis is a key technology for "DNA data storage", which encodes binary data stream (0/1) into quaternary long DNA sequences consisting of four bases (A/G/C/T). In this review, the workflow of DNA data storage and the basic methods of artificial DNA synthesis technology are outlined first. Then, the technical characteristics of different synthesis methods and the state-of-the-art of representative commercial companies, with a primary focus on silicon chip microarray-based synthesis and novel enzymatic DNA synthesis are presented. Finally, the recent status of DNA storage and new opportunities for future development in the field of high-throughput, large-scale DNA synthesis technology are summarized.
Collapse
Affiliation(s)
- Meng Yu
- Institute of Medical Chips, Ruijin Hospital, Shanghai Jiao Tong University School of Medicine, 200025, Shanghai, China.
- School of Microelectronics, Shanghai University, 201800, Shanghai, China
- Shanghai Industrial μTechnology Research Institute, 201800, Shanghai, China
| | - Xiaohui Tang
- Institute of Medical Chips, Ruijin Hospital, Shanghai Jiao Tong University School of Medicine, 200025, Shanghai, China.
- Shanghai Industrial μTechnology Research Institute, 201800, Shanghai, China
| | - Zhenhua Li
- Institute of Medical Chips, Ruijin Hospital, Shanghai Jiao Tong University School of Medicine, 200025, Shanghai, China.
- Shanghai Industrial μTechnology Research Institute, 201800, Shanghai, China
| | - Weidong Wang
- Shanghai Industrial μTechnology Research Institute, 201800, Shanghai, China
| | - Shaopeng Wang
- Institute of Molecular Medicine, Renji Hospital, School of Medicine, Shanghai Jiao Tong University, 200127, Shanghai, China.
| | - Min Li
- Institute of Molecular Medicine, Renji Hospital, School of Medicine, Shanghai Jiao Tong University, 200127, Shanghai, China.
| | - Qiuliyang Yu
- Shenzhen Key Laboratory for the Intelligent Microbial Manufacturing of Medicines, Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences, 518055, Shenzhen, China
| | - Sijia Xie
- Institute of Medical Chips, Ruijin Hospital, Shanghai Jiao Tong University School of Medicine, 200025, Shanghai, China.
- School of Microelectronics, Shanghai University, 201800, Shanghai, China
- Shanghai Industrial μTechnology Research Institute, 201800, Shanghai, China
| | - Xiaolei Zuo
- Institute of Molecular Medicine, Renji Hospital, School of Medicine, Shanghai Jiao Tong University, 200127, Shanghai, China.
| | - Chang Chen
- Institute of Medical Chips, Ruijin Hospital, Shanghai Jiao Tong University School of Medicine, 200025, Shanghai, China.
- School of Microelectronics, Shanghai University, 201800, Shanghai, China
- Shanghai Industrial μTechnology Research Institute, 201800, Shanghai, China
- State Key Laboratory of Transducer Technology, Shanghai Institute of Microsystem and Information Technology, Chinese Academy of Sciences, 200050, Shanghai, China
| |
Collapse
|
7
|
Brooks TG, Lahens NF, Mrčela A, Grant GR. Challenges and best practices in omics benchmarking. Nat Rev Genet 2024; 25:326-339. [PMID: 38216661 DOI: 10.1038/s41576-023-00679-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 11/14/2023] [Indexed: 01/14/2024]
Abstract
Technological advances enabling massively parallel measurement of biological features - such as microarrays, high-throughput sequencing and mass spectrometry - have ushered in the omics era, now in its third decade. The resulting complex landscape of analytical methods has naturally fostered the growth of an omics benchmarking industry. Benchmarking refers to the process of objectively comparing and evaluating the performance of different computational or analytical techniques when processing and analysing large-scale biological data sets, such as transcriptomics, proteomics and metabolomics. With thousands of omics benchmarking studies published over the past 25 years, the field has matured to the point where the foundations of benchmarking have been established and well described. However, generating meaningful benchmarking data and properly evaluating performance in this complex domain remains challenging. In this Review, we highlight some common oversights and pitfalls in omics benchmarking. We also establish a methodology to bring the issues that can be addressed into focus and to be transparent about those that cannot: this takes the form of a spreadsheet template of guidelines for comprehensive reporting, intended to accompany publications. In addition, a survey of recent developments in benchmarking is provided as well as specific guidance for commonly encountered difficulties.
Collapse
Affiliation(s)
- Thomas G Brooks
- Institute for Translational Medicine and Therapeutics, University of Pennsylvania, Philadelphia, PA, USA
| | - Nicholas F Lahens
- Institute for Translational Medicine and Therapeutics, University of Pennsylvania, Philadelphia, PA, USA
| | - Antonijo Mrčela
- Institute for Translational Medicine and Therapeutics, University of Pennsylvania, Philadelphia, PA, USA
| | - Gregory R Grant
- Institute for Translational Medicine and Therapeutics, University of Pennsylvania, Philadelphia, PA, USA.
- Department of Genetics, University of Pennsylvania, Philadelphia, PA, USA.
| |
Collapse
|
8
|
Hui X, Yang J, Sun J, Liu F, Pan W. MCSS: microbial community simulator based on structure. Front Microbiol 2024; 15:1358257. [PMID: 38516019 PMCID: PMC10956353 DOI: 10.3389/fmicb.2024.1358257] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/20/2023] [Accepted: 02/20/2024] [Indexed: 03/23/2024] Open
Abstract
De novo assembly plays a pivotal role in metagenomic analysis, and the incorporation of third-generation sequencing technology can significantly improve the integrity and accuracy of assembly results. Recently, with advancements in sequencing technology (Hi-Fi, ultra-long), several long-read-based bioinformatic tools have been developed. However, the validation of the performance and reliability of these tools is a crucial concern. To address this gap, we present MCSS (microbial community simulator based on structure), which has the capability to generate simulated microbial community and sequencing datasets based on the structure attributes of real microbiome communities. The evaluation results indicate that it can generate simulated communities that exhibit both diversity and similarity to actual community structures. Additionally, MCSS generates synthetic PacBio Hi-Fi and Oxford Nanopore Technologies (ONT) long reads for the species within the simulated community. This innovative tool provides a valuable resource for benchmarking and refining metagenomic analysis methods. Code available at: https://github.com/panlab-bio/mcss.
Collapse
Affiliation(s)
- Xingqi Hui
- Zhengzhou Research Base, State Key Laboratory of Cotton Biology, School of Agricultural Sciences, Zhengzhou University, Zhengzhou, China
- Shenzhen Branch, Guangdong Laboratory of Lingnan Modern Agriculture, Genome Analysis Laboratory of the Ministry of Agriculture and Rural Affairs, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences (ICR, CAAS), Shenzhen, China
| | - Jinbao Yang
- Shenzhen Branch, Guangdong Laboratory of Lingnan Modern Agriculture, Genome Analysis Laboratory of the Ministry of Agriculture and Rural Affairs, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences (ICR, CAAS), Shenzhen, China
- College of Informatics, Huazhong Agricultural University, Wuhan, China
| | - Jinhuan Sun
- Key Laboratory of Plant Molecular Physiology, CAS Center for Excellence in Molecular Plant Sciences, Institute of Botany, Chinese Academy of Sciences, Beijing, China
| | - Fang Liu
- Zhengzhou Research Base, State Key Laboratory of Cotton Biology, School of Agricultural Sciences, Zhengzhou University, Zhengzhou, China
- National Key Laboratory of Cotton Bio-Breeding and Integrated Utilization, Institute of Cotton Research, Chinese Academy of Agricultural Sciences (ICR, CAAS), Anyang, China
| | - Weihua Pan
- Shenzhen Branch, Guangdong Laboratory of Lingnan Modern Agriculture, Genome Analysis Laboratory of the Ministry of Agriculture and Rural Affairs, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences (ICR, CAAS), Shenzhen, China
| |
Collapse
|
9
|
Lai J, Liu Y, Scharpf RB, Karchin R. Evaluation of simulation methods for tumor subclonal reconstruction. ARXIV 2024:arXiv:2402.09599v1. [PMID: 38410652 PMCID: PMC10896360] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Subscribe] [Scholar Register] [Indexed: 02/28/2024]
Abstract
Most neoplastic tumors originate from a single cell, and their evolution can be genetically traced through lineages characterized by common alterations such as small somatic mutations (SSMs), copy number alterations (CNAs), structural variants (SVs), and aneuploidies. Due to the complexity of these alterations in most tumors and the errors introduced by sequencing protocols and calling algorithms, tumor subclonal reconstruction algorithms are necessary to recapitulate the DNA sequence composition and tumor evolution in silico. With a growing number of these algorithms available, there is a pressing need for consistent and comprehensive benchmarking, which relies on realistic tumor sequencing generated by simulation tools. Here, we examine the current simulation methods, identifying their strengths and weaknesses, and provide recommendations for their improvement. Our review also explores potential new directions for research in this area. This work aims to serve as a resource for understanding and enhancing tumor genomic simulations, contributing to the advancement of the field.
Collapse
Affiliation(s)
- Jiaying Lai
- Institute for Computational Medicine, Johns Hopkins University, Baltimore, MD
| | - Yunzhou Liu
- Institute for Computational Medicine, Johns Hopkins University, Baltimore, MD
| | - Robert B. Scharpf
- Sidney Kimmel Comprehensive Cancer Center, Johns Hopkins University School of Medicine, Baltimore, MD
- Department of Oncology, Johns Hopkins Medical Institutions, Baltimore, MD
| | - Rachel Karchin
- Institute for Computational Medicine, Johns Hopkins University, Baltimore, MD
- Sidney Kimmel Comprehensive Cancer Center, Johns Hopkins University School of Medicine, Baltimore, MD
- Department of Oncology, Johns Hopkins Medical Institutions, Baltimore, MD
- Department of Biomedical Engineering, Johns Hopkins University, Baltimore, MD
| |
Collapse
|
10
|
Joeres M, Maksimov P, Höper D, Calvelage S, Calero-Bernal R, Fernández-Escobar M, Koudela B, Blaga R, Vrhovec MG, Stollberg K, Bier N, Sotiraki S, Sroka J, Piotrowska W, Kodym P, Basso W, Conraths FJ, Mercier A, Galal L, Dardé ML, Balea A, Spano F, Schulze C, Peters M, Scuda N, Lundén A, Davidson RK, Terland R, Waap H, de Bruin E, Vatta P, Caccio S, Ortega-Mora LM, Jokelainen P, Schares G. Genotyping of European Toxoplasma gondii strains by a new high-resolution next-generation sequencing-based method. Eur J Clin Microbiol Infect Dis 2024; 43:355-371. [PMID: 38099986 PMCID: PMC10822014 DOI: 10.1007/s10096-023-04721-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/27/2023] [Accepted: 11/16/2023] [Indexed: 01/28/2024]
Abstract
PURPOSE A new high-resolution next-generation sequencing (NGS)-based method was established to type closely related European type II Toxoplasma gondii strains. METHODS T. gondii field isolates were collected from different parts of Europe and assessed by whole genome sequencing (WGS). In comparison to ME49 (a type II reference strain), highly polymorphic regions (HPRs) were identified, showing a considerable number of single nucleotide polymorphisms (SNPs). After confirmation by Sanger sequencing, 18 HPRs were used to design a primer panel for multiplex PCR to establish a multilocus Ion AmpliSeq typing method. Toxoplasma gondii isolates and T. gondii present in clinical samples were typed with the new method. The sensitivity of the method was tested with serially diluted reference DNA samples. RESULTS Among type II specimens, the method could differentiate the same number of haplotypes as the reference standard, microsatellite (MS) typing. Passages of the same isolates and specimens originating from abortion outbreaks were identified as identical. In addition, seven different genotypes, two atypical and two recombinant specimens were clearly distinguished from each other by the method. Furthermore, almost all SNPs detected by the Ion AmpliSeq method corresponded to those expected based on WGS. By testing serially diluted DNA samples, the method exhibited a similar analytical sensitivity as MS typing. CONCLUSION The new method can distinguish different T. gondii genotypes and detect intra-genotype variability among European type II T. gondii strains. Furthermore, with WGS data additional target regions can be added to the method to potentially increase typing resolution.
Collapse
Affiliation(s)
- M Joeres
- Friedrich-Loeffler-Institut, Federal Research Institute for Animal Health, Institute of Epidemiology, Greifswald - Insel Riems, Germany
| | - P Maksimov
- Friedrich-Loeffler-Institut, Federal Research Institute for Animal Health, Institute of Epidemiology, Greifswald - Insel Riems, Germany
| | - D Höper
- Friedrich-Loeffler-Institut, Federal Research Institute for Animal Health, Institute of Diagnostic Virology, Greifswald - Insel Riems, Germany
| | - S Calvelage
- Friedrich-Loeffler-Institut, Federal Research Institute for Animal Health, Institute of Diagnostic Virology, Greifswald - Insel Riems, Germany
| | - R Calero-Bernal
- SALUVET, Animal Health Department, Faculty of Veterinary Sciences, Complutense University of Madrid, Madrid, Spain
| | - M Fernández-Escobar
- SALUVET, Animal Health Department, Faculty of Veterinary Sciences, Complutense University of Madrid, Madrid, Spain
| | - B Koudela
- Central European Institute of Technology (CEITEC), University of Veterinary Sciences Brno, Brno, Czech Republic
- Faculty of Veterinary Medicine, University of Veterinary Sciences Brno, Brno, Czech Republic
| | - R Blaga
- Anses, INRAE, Ecole Nationale Vétérinaire d'Alfort, Laboratoire de Santé Animale, BIPAR, Maisons-Alfort, France
- University of Agricultural Sciences and Veterinary Medicine, Cluj-Napoca, Romania
| | | | - K Stollberg
- German Federal Institute for Risk Assessment, Department for Biological Safety, Berlin, Germany
| | - N Bier
- German Federal Institute for Risk Assessment, Department for Biological Safety, Berlin, Germany
| | - S Sotiraki
- Veterinary Research Institute, Hellenic Agricultural Organisation-DIMITRA, Thessaloniki, Greece
| | - J Sroka
- Department of Parasitology and Invasive Diseases, National Veterinary Research Institute, Pulawy, Poland
| | - W Piotrowska
- Department of Parasitology and Invasive Diseases, National Veterinary Research Institute, Pulawy, Poland
| | - P Kodym
- Centre of Epidemiology and Microbiology, National Institute of Public Health, Prague, Czech Republic
| | - W Basso
- Institute of Parasitology, Vetsuisse Faculty, University of Bern, Bern, Switzerland
| | - F J Conraths
- Friedrich-Loeffler-Institut, Federal Research Institute for Animal Health, Institute of Epidemiology, Greifswald - Insel Riems, Germany
| | - A Mercier
- Inserm U1094, IRD U270, Univ. Limoges, CHU Limoges, EpiMaCT - Epidemiology of chronic diseases in tropical zone, Institute of Epidemiology and Tropical Neurology, OmegaHealth, Limoges, France
- Centre National de Référence (CNR) Toxoplasmose Centre Hospitalier-Universitaire Dupuytren, Limoges, France
| | - L Galal
- Inserm U1094, IRD U270, Univ. Limoges, CHU Limoges, EpiMaCT - Epidemiology of chronic diseases in tropical zone, Institute of Epidemiology and Tropical Neurology, OmegaHealth, Limoges, France
| | - M L Dardé
- Inserm U1094, IRD U270, Univ. Limoges, CHU Limoges, EpiMaCT - Epidemiology of chronic diseases in tropical zone, Institute of Epidemiology and Tropical Neurology, OmegaHealth, Limoges, France
- Centre National de Référence (CNR) Toxoplasmose Centre Hospitalier-Universitaire Dupuytren, Limoges, France
| | - A Balea
- University of Agricultural Sciences and Veterinary Medicine Cluj-Napoca, Faculty of Veterinary Medicine, Department of Parasitology and Parasitic Diseases, Cluj-Napoca, Romania
| | - F Spano
- Italian National Institute of Health, Rome, Italy
| | - C Schulze
- Landeslabor Berlin-Brandenburg, Frankfurt (Oder), Germany
| | - M Peters
- Chemisches und Veterinäruntersuchungsamt Westfalen, Standort Arnsberg, Arnsberg, Germany
| | - N Scuda
- Bavarian Health and Food Safety Authority, Erlangen, Germany
| | - A Lundén
- Department of Microbiology, National Veterinary Institute, Uppsala, Sweden
| | - R K Davidson
- Department of Animal Health, Welfare and Food Safety, Norwegian Veterinary Institute, Tromsø, Norway
| | - R Terland
- Department of Analysis and Diagnostics, Norwegian Veterinary Institute, Ås, Norway
| | - H Waap
- Parasitology Laboratory, Instituto Nacional de Investigação Agrária e Veterinária, Oeiras, Portugal
| | - E de Bruin
- Dutch Wildlife Health Centre, Pathology Division, Department of Pathobiology, Faculty of Veterinary Medicine, University of Utrecht, Utrecht, The Netherlands
| | - P Vatta
- Italian National Institute of Health, Rome, Italy
| | - S Caccio
- Italian National Institute of Health, Rome, Italy
| | - L M Ortega-Mora
- SALUVET, Animal Health Department, Faculty of Veterinary Sciences, Complutense University of Madrid, Madrid, Spain
| | - P Jokelainen
- Infectious Disease Preparedness, Statens Serum Institut, Copenhagen, Denmark
| | - G Schares
- Friedrich-Loeffler-Institut, Federal Research Institute for Animal Health, Institute of Epidemiology, Greifswald - Insel Riems, Germany.
| |
Collapse
|
11
|
Mestre-Tomás J, Liu T, Pardo-Palacios F, Conesa A. SQANTI-SIM: a simulator of controlled transcript novelty for lrRNA-seq benchmark. Genome Biol 2023; 24:286. [PMID: 38082294 PMCID: PMC10712166 DOI: 10.1186/s13059-023-03127-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/22/2023] [Accepted: 11/27/2023] [Indexed: 12/18/2023] Open
Abstract
Long-read RNA sequencing has emerged as a powerful tool for transcript discovery, even in well-annotated organisms. However, assessing the accuracy of different methods in identifying annotated and novel transcripts remains a challenge. Here, we present SQANTI-SIM, a versatile tool that wraps around popular long-read simulators to allow precise management of transcript novelty based on the structural categories defined by SQANTI3. By selectively excluding specific transcripts from the reference dataset, SQANTI-SIM effectively emulates scenarios involving unannotated transcripts. Furthermore, the tool provides customizable features and supports the simulation of additional types of data, representing the first multi-omics simulation tool for the lrRNA-seq field.
Collapse
Affiliation(s)
- Jorge Mestre-Tomás
- Institute for Integrative Systems Biology, Spanish National Research Council, Catedrátic Agustín Escardino Benlloch, Paterna, 46980, Spain
- Department of Applied Statistics, Operations Research and Quality, Universitat Politècnica de València, Camino de Vera, Valencia, 46022, Spain
| | - Tianyuan Liu
- Institute for Integrative Systems Biology, Spanish National Research Council, Catedrátic Agustín Escardino Benlloch, Paterna, 46980, Spain
| | - Francisco Pardo-Palacios
- Institute for Integrative Systems Biology, Spanish National Research Council, Catedrátic Agustín Escardino Benlloch, Paterna, 46980, Spain
| | - Ana Conesa
- Institute for Integrative Systems Biology, Spanish National Research Council, Catedrátic Agustín Escardino Benlloch, Paterna, 46980, Spain.
| |
Collapse
|
12
|
Mwima R, Hui TYJ, Nanteza A, Burt A, Kayondo JK. Potential persistence mechanisms of the major Anopheles gambiae species complex malaria vectors in sub-Saharan Africa: a narrative review. Malar J 2023; 22:336. [PMID: 37936194 PMCID: PMC10631165 DOI: 10.1186/s12936-023-04775-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/22/2023] [Accepted: 10/30/2023] [Indexed: 11/09/2023] Open
Abstract
The source of malaria vector populations that re-establish at the beginning of the rainy season is still unclear yet knowledge of mosquito behaviour is required to effectively institute control measures. Alternative hypotheses like aestivation, local refugia, migration between neighbouring sites, and long-distance migration (LDM) are stipulated to support mosquito persistence. This work assessed the malaria vector persistence dynamics and examined various studies done on vector survival via these hypotheses; aestivation, local refugia, local or long-distance migration across sub-Saharan Africa, explored a range of methods used, ecological parameters and highlighted the knowledge trends and gaps. The results about a particular persistence mechanism that supports the re-establishment of Anopheles gambiae, Anopheles coluzzii or Anopheles arabiensis in sub-Saharan Africa were not conclusive given that each method used had its limitations. For example, the Mark-Release-Recapture (MRR) method whose challenge is a low recapture rate that affects its accuracy, and the use of time series analysis through field collections whose challenge is the uncertainty about whether not finding mosquitoes during the dry season is a weakness of the conventional sampling methods used or because of hidden shelters. This, therefore, calls for further investigations emphasizing the use of ecological experiments under controlled conditions in the laboratory or semi-field, and genetic approaches, as they are known to complement each other. This review, therefore, unveils and assesses the uncertainties that influence the different malaria vector persistence mechanisms and provides recommendations for future studies.
Collapse
Affiliation(s)
- Rita Mwima
- Department of Entomology, Uganda Virus Research Institute (UVRI), Entebbe, Uganda
- Department of Biotechnical and Diagnostic Sciences, College of Veterinary Medicine, Animal Resources and Biosecurity (COVAB), Makerere University, Kampala, Uganda
| | - Tin-Yu J Hui
- Silwood Park Campus, Department of Life Sciences, Imperial College London, Ascot, UK
| | - Ann Nanteza
- Department of Biotechnical and Diagnostic Sciences, College of Veterinary Medicine, Animal Resources and Biosecurity (COVAB), Makerere University, Kampala, Uganda
| | - Austin Burt
- Silwood Park Campus, Department of Life Sciences, Imperial College London, Ascot, UK
| | - Jonathan K Kayondo
- Department of Entomology, Uganda Virus Research Institute (UVRI), Entebbe, Uganda.
| |
Collapse
|
13
|
Mesloub Y, Beury D, Vandermeeren F, Caboche S. CuReSim-LoRM: A Tool to Simulate Metabarcoding Long Reads. Int J Mol Sci 2023; 24:14005. [PMID: 37762307 PMCID: PMC10531135 DOI: 10.3390/ijms241814005] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/07/2023] [Revised: 09/07/2023] [Accepted: 09/10/2023] [Indexed: 09/29/2023] Open
Abstract
Metabarcoding DNA sequencing has revolutionized the study of microbial communities. Third-generation sequencing producing long reads had opened up new perspectives. Obtaining the full-length ribosomal RNA gene would permit one to reach a better taxonomic resolution at the species or the strain level. However, Oxford Nanopore Technologies (ONT) sequencing produces reads with high error rates, which introduces biases in analysis. Understanding the biases introduced during the analysis allows one to better interpret the biological results and take care of conclusions drawn from metabarcoding experiments. To benchmark an analysis process, the ground truth, i.e., the real composition of the microbial community, has to be known. In addition to artificial mock communities, simulated data are often used to evaluate the biases and performances of the bioinformatics analysis step. Currently, no specific tool has been developed to simulate metabarcoding long reads, mimic the error rate and the length distribution, and allow one to benchmark the analysis process. Here, we introduce CuReSim-LoRM, for the customized read simulator to generate long reads for metabarcoding. We showed that CuReSim-LoRM is able to produce reads with varying error rates and length distributions by mimicking the real data very well.
Collapse
Affiliation(s)
| | | | | | - Ségolène Caboche
- Univ. Lille, CNRS, Inserm, CHU Lille, Institut Pasteur de Lille, US 41-UAR 2014-PLBS, F-59000 Lille, France
| |
Collapse
|
14
|
Mestre-Tomás J, Liu T, Pardo-Palacios F, Conesa A. SQANTI-SIM: a simulator of controlled transcript novelty for lrRNA-seq benchmark. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.08.23.554392. [PMID: 37662216 PMCID: PMC10473693 DOI: 10.1101/2023.08.23.554392] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 09/05/2023]
Abstract
Long-read RNA-seq has emerged as a powerful tool for transcript discovery, even in well-annotated organisms. However, assessing the accuracy of different methods in identifying annotated and novel transcripts remains a challenge. Here, we present SQANTI-SIM, a versatile utility that wraps around popular long-read simulators to allow precise management of transcript novelty based on the structural categories defined by SQANTI3. By selectively excluding specific transcripts from the reference dataset, SQANTI-SIM effectively emulates scenarios involving unannotated transcripts. Furthermore, the tool provides customizable features and supports the simulation of additional types of data, representing the first multi-omics simulation tool for the lrRNA-seq field. We demonstrate the effectiveness of SQANTI-SIM by benchmarking five transcriptome reconstruction pipelines using the simulated data.
Collapse
Affiliation(s)
- Jorge Mestre-Tomás
- Institute for Integrative Systems Biology, Spanish National Research Council, Catedràtic Agustín Escardino Benlloch, Paterna, 46980, Spain
| | - Tianyuan Liu
- Institute for Integrative Systems Biology, Spanish National Research Council, Catedràtic Agustín Escardino Benlloch, Paterna, 46980, Spain
| | - Francisco Pardo-Palacios
- Institute for Integrative Systems Biology, Spanish National Research Council, Catedràtic Agustín Escardino Benlloch, Paterna, 46980, Spain
| | - Ana Conesa
- Institute for Integrative Systems Biology, Spanish National Research Council, Catedràtic Agustín Escardino Benlloch, Paterna, 46980, Spain
| |
Collapse
|
15
|
Abstract
Following the widespread use of deep learning for genomics, deep generative modeling is also becoming a viable methodology for the broad field. Deep generative models (DGMs) can learn the complex structure of genomic data and allow researchers to generate novel genomic instances that retain the real characteristics of the original dataset. Aside from data generation, DGMs can also be used for dimensionality reduction by mapping the data space to a latent space, as well as for prediction tasks via exploitation of this learned mapping or supervised/semi-supervised DGM designs. In this review, we briefly introduce generative modeling and two currently prevailing architectures, we present conceptual applications along with notable examples in functional and evolutionary genomics, and we provide our perspective on potential challenges and future directions.
Collapse
Affiliation(s)
- Burak Yelmen
- Laboratoire Interdisciplinaire des Sciences du Numérique, CNRS UMR 9015, INRIA, Université Paris-Saclay, Orsay, France;
- Institute of Genomics, University of Tartu, Tartu, Estonia
| | - Flora Jay
- Laboratoire Interdisciplinaire des Sciences du Numérique, CNRS UMR 9015, INRIA, Université Paris-Saclay, Orsay, France;
| |
Collapse
|
16
|
Korfmann K, Gaggiotti OE, Fumagalli M. Deep Learning in Population Genetics. Genome Biol Evol 2023; 15:evad008. [PMID: 36683406 PMCID: PMC9897193 DOI: 10.1093/gbe/evad008] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/13/2022] [Revised: 12/19/2022] [Accepted: 01/16/2023] [Indexed: 01/24/2023] Open
Abstract
Population genetics is transitioning into a data-driven discipline thanks to the availability of large-scale genomic data and the need to study increasingly complex evolutionary scenarios. With likelihood and Bayesian approaches becoming either intractable or computationally unfeasible, machine learning, and in particular deep learning, algorithms are emerging as popular techniques for population genetic inferences. These approaches rely on algorithms that learn non-linear relationships between the input data and the model parameters being estimated through representation learning from training data sets. Deep learning algorithms currently employed in the field comprise discriminative and generative models with fully connected, convolutional, or recurrent layers. Additionally, a wide range of powerful simulators to generate training data under complex scenarios are now available. The application of deep learning to empirical data sets mostly replicates previous findings of demography reconstruction and signals of natural selection in model organisms. To showcase the feasibility of deep learning to tackle new challenges, we designed a branched architecture to detect signals of recent balancing selection from temporal haplotypic data, which exhibited good predictive performance on simulated data. Investigations on the interpretability of neural networks, their robustness to uncertain training data, and creative representation of population genetic data, will provide further opportunities for technological advancements in the field.
Collapse
Affiliation(s)
- Kevin Korfmann
- Professorship for Population Genetics, Department of Life Science Systems, Technical University of Munich, Germany
| | - Oscar E Gaggiotti
- Centre for Biological Diversity, Sir Harold Mitchell Building, University of St Andrews, Fife KY16 9TF, UK
| | - Matteo Fumagalli
- Department of Biological and Behavioural Sciences, Queen Mary University of London, UK
| |
Collapse
|
17
|
Performance evaluation of six popular short-read simulators. Heredity (Edinb) 2023; 130:55-63. [PMID: 36496447 PMCID: PMC9905089 DOI: 10.1038/s41437-022-00577-3] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/06/2022] [Revised: 11/10/2022] [Accepted: 11/11/2022] [Indexed: 12/14/2022] Open
Abstract
High-throughput sequencing data enables the comprehensive study of genomes and the variation therein. Essential for the interpretation of this genomic data is a thorough understanding of the computational methods used for processing and analysis. Whereas "gold-standard" empirical datasets exist for this purpose in humans, synthetic (i.e., simulated) sequencing data can offer important insights into the capabilities and limitations of computational pipelines for any arbitrary species and/or study design-yet, the ability of read simulator software to emulate genomic characteristics of empirical datasets remains poorly understood. We here compare the performance of six popular short-read simulators-ART, DWGSIM, InSilicoSeq, Mason, NEAT, and wgsim-and discuss important considerations for selecting suitable models for benchmarking.
Collapse
|
18
|
Silva JM, Qi W, Pinho AJ, Pratas D. AlcoR: alignment-free simulation, mapping, and visualization of low-complexity regions in biological data. Gigascience 2022; 12:giad101. [PMID: 38091509 PMCID: PMC10716826 DOI: 10.1093/gigascience/giad101] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/28/2023] [Revised: 09/29/2023] [Accepted: 11/07/2023] [Indexed: 12/18/2023] Open
Abstract
BACKGROUND Low-complexity data analysis is the area that addresses the search and quantification of regions in sequences of elements that contain low-complexity or repetitive elements. For example, these can be tandem repeats, inverted repeats, homopolymer tails, GC-biased regions, similar genes, and hairpins, among many others. Identifying these regions is crucial because of their association with regulatory and structural characteristics. Moreover, their identification provides positional and quantity information where standard assembly methodologies face significant difficulties because of substantial higher depth coverage (mountains), ambiguous read mapping, or where sequencing or reconstruction defects may occur. However, the capability to distinguish low-complexity regions (LCRs) in genomic and proteomic sequences is a challenge that depends on the model's ability to find them automatically. Low-complexity patterns can be implicit through specific or combined sources, such as algorithmic or probabilistic, and recurring to different spatial distances-namely, local, medium, or distant associations. FINDINGS This article addresses the challenge of automatically modeling and distinguishing LCRs, providing a new method and tool (AlcoR) for efficient and accurate segmentation and visualization of these regions in genomic and proteomic sequences. The method enables the use of models with different memories, providing the ability to distinguish local from distant low-complexity patterns. The method is reference and alignment free, providing additional methodologies for testing, including a highly flexible simulation method for generating biological sequences (DNA or protein) with different complexity levels, sequence masking, and a visualization tool for automatic computation of the LCR maps into an ideogram style. We provide illustrative demonstrations using synthetic, nearly synthetic, and natural sequences showing the high efficiency and accuracy of AlcoR. As large-scale results, we use AlcoR to unprecedentedly provide a whole-chromosome low-complexity map of a recent complete human genome and the haplotype-resolved chromosome pairs of a heterozygous diploid African cassava cultivar. CONCLUSIONS The AlcoR method provides the ability of fast sequence characterization through data complexity analysis, ideally for scenarios entangling the presence of new or unknown sequences. AlcoR is implemented in C language using multithreading to increase the computational speed, is flexible for multiple applications, and does not contain external dependencies. The tool accepts any sequence in FASTA format. The source code is freely provided at https://github.com/cobilab/alcor.
Collapse
Affiliation(s)
- Jorge M Silva
- IEETA, Institute of Electronics and Informatics Engineering of Aveiro, and LASI, Intelligent Systems Associate Laboratory, University of Aveiro, Campus Universitário de Santiago, 3810-193 Aveiro, Portugal
- Department of Electronics Telecommunications and Informatics, University of Aveiro, Campus Universitario de Santiago, 3810-193, Aveiro, Portugal
| | - Weihong Qi
- Functional Genomics Center Zurich, ETH Zurich and University of Zurich, Winterthurerstrasse, 190, 8057, Zurich, Switzerland
- SIB, Swiss Institute of Bioinformatics, 1202, Geneva, Switzerland
| | - Armando J Pinho
- IEETA, Institute of Electronics and Informatics Engineering of Aveiro, and LASI, Intelligent Systems Associate Laboratory, University of Aveiro, Campus Universitário de Santiago, 3810-193 Aveiro, Portugal
- Department of Electronics Telecommunications and Informatics, University of Aveiro, Campus Universitario de Santiago, 3810-193, Aveiro, Portugal
| | - Diogo Pratas
- IEETA, Institute of Electronics and Informatics Engineering of Aveiro, and LASI, Intelligent Systems Associate Laboratory, University of Aveiro, Campus Universitário de Santiago, 3810-193 Aveiro, Portugal
- Department of Electronics Telecommunications and Informatics, University of Aveiro, Campus Universitario de Santiago, 3810-193, Aveiro, Portugal
- Department of Virology, University of Helsinki, Haartmaninkatu, 3, 00014 Helsinki, Finland
| |
Collapse
|
19
|
Shang J, Cai X, Zhang T, Sun Y, Zhang Y, Liu J, Guan B. EpiReSIM: A Resampling Method of Epistatic Model without Marginal Effects Using Under-Determined System of Equations. Genes (Basel) 2022; 13:genes13122286. [PMID: 36553553 PMCID: PMC9777644 DOI: 10.3390/genes13122286] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/18/2022] [Revised: 11/30/2022] [Accepted: 12/01/2022] [Indexed: 12/12/2022] Open
Abstract
Simulation experiments are essential to evaluate epistasis detection methods, which is the main way to prove their effectiveness and move toward practical applications. However, due to the lack of effective simulators, especially for simulating models without marginal effects (eNME models), epistasis detection methods can hardly verify their effectiveness through simulation experiments. In this study, we propose a resampling simulation method (EpiReSIM) for generating the eNME model. First, EpiReSIM provides two strategies for solving eNME models. One is to calculate eNME models using prevalence constraints, and another is by joint constraints of prevalence and heritability. We transform the computation of the model into the problem of solving the under-determined system of equations. Introducing the complete orthogonal decomposition method and Newton's method, EpiReSIM calculates the solution of the underdetermined system of equations to obtain the eNME model, especially the solution of the high-order model, which is the highlight of EpiReSIM. Second, based on the computed eNME model, EpiReSIM generates simulation data by a resampling method. Experimental results show that EpiReSIM has advantages in preserving the biological properties of minor allele frequencies and calculating high-order models, and it is a convenient and effective alternative method for current simulation software.
Collapse
Affiliation(s)
- Junliang Shang
- School of Computer Science, Qufu Normal University, Rizhao 276826, China
| | - Xinrui Cai
- School of Computer Science, Qufu Normal University, Rizhao 276826, China
| | - Tongdui Zhang
- Science and Technology Innovation Service Institution of Rizhao, Rizhao 276827, China
| | - Yan Sun
- School of Computer Science, Qufu Normal University, Rizhao 276826, China
| | - Yuanyuan Zhang
- School of Information and Control Engineering, Qingdao University of Technology, Qingdao 266520, China
| | - Jinxing Liu
- School of Computer Science, Qufu Normal University, Rizhao 276826, China
| | - Boxin Guan
- School of Computer Science, Qufu Normal University, Rizhao 276826, China
- Correspondence:
| |
Collapse
|
20
|
Ono Y, Hamada M, Asai K. PBSIM3: a simulator for all types of PacBio and ONT long reads. NAR Genom Bioinform 2022; 4:lqac092. [PMID: 36465498 PMCID: PMC9713900 DOI: 10.1093/nargab/lqac092] [Citation(s) in RCA: 5] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/10/2022] [Revised: 11/02/2022] [Accepted: 11/12/2022] [Indexed: 12/03/2022] Open
Abstract
Long-read sequencers, such as Pacific Biosciences (PacBio) and Oxford Nanopore Technologies (ONT) sequencers, have improved their read length and accuracy, thereby opening up unprecedented research. Many tools and algorithms have been developed to analyze long reads, and rapid progress in PacBio and ONT has further accelerated their development. Together with the development of high-throughput sequencing technologies and their analysis tools, many read simulators have been developed and effectively utilized. PBSIM is one of the popular long-read simulators. In this study, we developed PBSIM3 with three new functions: error models for long reads, multi-pass sequencing for high-fidelity read simulation and transcriptome sequencing simulation. Therefore, PBSIM3 is now able to meet a wide range of long-read simulation requirements.
Collapse
Affiliation(s)
- Yukiteru Ono
- Department of Computational Biology and Medical Sciences, Graduate School of Frontier Sciences, University of Tokyo, 5-1-5 Kashiwanoha, Kashiwa 277-8561, Japan
| | - Michiaki Hamada
- Department of Electrical Engineering and Bioscience, Faculty of Science and Engineering, Waseda University, 55N-06-10, 3-4-1, Okubo, Shinjuku-ku, Tokyo 169-8555, Japan
- Computational Bio Big-Data Open Innovation Laboratory (CBBD-OIL), National Institute of Advanced Industrial Science and Technology (AIST), 63-520, 3-4-1, Okubo, Shinjuku-ku, Tokyo 169-8555, Japan
- Institute for Medical-Oriented Structural Biology, Waseda University, 2-2, Wakamatsu-cho, Shinjuku-ku, Tokyo 162-8480, Japan
- Graduate School of Medicine, Nippon Medical School, 1-1-5, Sendagi, Bunkyo-ku, Tokyo, 113-8602, Japan
| | - Kiyoshi Asai
- Department of Computational Biology and Medical Sciences, Graduate School of Frontier Sciences, University of Tokyo, 5-1-5 Kashiwanoha, Kashiwa 277-8561, Japan
- Artificial Intelligence Research Center (AIRC), National Institute of Advanced Industrial Science and Technology (AIST), 2-3-26, Aomi, Koto-ku, 135-0064 Tokyo, Japan
| |
Collapse
|
21
|
Genome sequence assembly algorithms and misassembly identification methods. Mol Biol Rep 2022; 49:11133-11148. [PMID: 36151399 DOI: 10.1007/s11033-022-07919-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/20/2022] [Accepted: 09/05/2022] [Indexed: 10/14/2022]
Abstract
The sequence assembly algorithms have rapidly evolved with the vigorous growth of genome sequencing technology over the past two decades. Assembly mainly uses the iterative expansion of overlap relationships between sequences to construct the target genome. The assembly algorithms can be typically classified into several categories, such as the Greedy strategy, Overlap-Layout-Consensus (OLC) strategy, and de Bruijn graph (DBG) strategy. In particular, due to the rapid development of third-generation sequencing (TGS) technology, some prevalent assembly algorithms have been proposed to generate high-quality chromosome-level assemblies. However, due to the genome complexity, the length of short reads, and the high error rate of long reads, contigs produced by assembly may contain misassemblies adversely affecting downstream data analysis. Therefore, several read-based and reference-based methods for misassembly identification have been developed to improve assembly quality. This work primarily reviewed the development of DNA sequencing technologies and summarized sequencing data simulation methods, sequencing error correction methods, various mainstream sequence assembly algorithms, and misassembly identification methods. A large amount of computation makes the sequence assembly problem more challenging, and therefore, it is necessary to develop more efficient and accurate assembly algorithms and alternative algorithms.
Collapse
|
22
|
Alser M, Lindegger J, Firtina C, Almadhoun N, Mao H, Singh G, Gomez-Luna J, Mutlu O. From molecules to genomic variations: Accelerating genome analysis via intelligent algorithms and architectures. Comput Struct Biotechnol J 2022; 20:4579-4599. [PMID: 36090814 PMCID: PMC9436709 DOI: 10.1016/j.csbj.2022.08.019] [Citation(s) in RCA: 7] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/12/2022] [Revised: 08/08/2022] [Accepted: 08/08/2022] [Indexed: 02/01/2023] Open
Abstract
We now need more than ever to make genome analysis more intelligent. We need to read, analyze, and interpret our genomes not only quickly, but also accurately and efficiently enough to scale the analysis to population level. There currently exist major computational bottlenecks and inefficiencies throughout the entire genome analysis pipeline, because state-of-the-art genome sequencing technologies are still not able to read a genome in its entirety. We describe the ongoing journey in significantly improving the performance, accuracy, and efficiency of genome analysis using intelligent algorithms and hardware architectures. We explain state-of-the-art algorithmic methods and hardware-based acceleration approaches for each step of the genome analysis pipeline and provide experimental evaluations. Algorithmic approaches exploit the structure of the genome as well as the structure of the underlying hardware. Hardware-based acceleration approaches exploit specialized microarchitectures or various execution paradigms (e.g., processing inside or near memory) along with algorithmic changes, leading to new hardware/software co-designed systems. We conclude with a foreshadowing of future challenges, benefits, and research directions triggered by the development of both very low cost yet highly error prone new sequencing technologies and specialized hardware chips for genomics. We hope that these efforts and the challenges we discuss provide a foundation for future work in making genome analysis more intelligent.
Collapse
Affiliation(s)
| | | | - Can Firtina
- ETH Zurich, Gloriastrasse 35, 8092 Zürich, Switzerland
| | | | - Haiyu Mao
- ETH Zurich, Gloriastrasse 35, 8092 Zürich, Switzerland
| | | | | | - Onur Mutlu
- ETH Zurich, Gloriastrasse 35, 8092 Zürich, Switzerland
| |
Collapse
|
23
|
Angaroni F, Guidi A, Ascolani G, d'Onofrio A, Antoniotti M, Graudenzi A. J-SPACE: a Julia package for the simulation of spatial models of cancer evolution and of sequencing experiments. BMC Bioinformatics 2022; 23:269. [PMID: 35804300 PMCID: PMC9270769 DOI: 10.1186/s12859-022-04779-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/23/2022] [Accepted: 06/09/2022] [Indexed: 11/15/2022] Open
Abstract
Background The combined effects of biological variability and measurement-related errors on cancer sequencing data remain largely unexplored. However, the spatio-temporal simulation of multi-cellular systems provides a powerful instrument to address this issue. In particular, efficient algorithmic frameworks are needed to overcome the harsh trade-off between scalability and expressivity, so to allow one to simulate both realistic cancer evolution scenarios and the related sequencing experiments, which can then be used to benchmark downstream bioinformatics methods. Result We introduce a Julia package for SPAtial Cancer Evolution (J-SPACE), which allows one to model and simulate a broad set of experimental scenarios, phenomenological rules and sequencing settings.Specifically, J-SPACE simulates the spatial dynamics of cells as a continuous-time multi-type birth-death stochastic process on a arbitrary graph, employing different rules of interaction and an optimised Gillespie algorithm. The evolutionary dynamics of genomic alterations (single-nucleotide variants and indels) is simulated either under the Infinite Sites Assumption or several different substitution models, including one based on mutational signatures. After mimicking the spatial sampling of tumour cells, J-SPACE returns the related phylogenetic model, and allows one to generate synthetic reads from several Next-Generation Sequencing (NGS) platforms, via the ART read simulator. The results are finally returned in standard FASTA, FASTQ, SAM, ALN and Newick file formats. Conclusion J-SPACE is designed to efficiently simulate the heterogeneous behaviour of a large number of cancer cells and produces a rich set of outputs. Our framework is useful to investigate the emergent spatial dynamics of cancer subpopulations, as well as to assess the impact of incomplete sampling and of experiment-specific errors. Importantly, the output of J-SPACE is designed to allow the performance assessment of downstream bioinformatics pipelines processing NGS data. J-SPACE is freely available at: https://github.com/BIMIB-DISCo/J-Space.jl.
Collapse
Affiliation(s)
- Fabrizio Angaroni
- Dept. of Informatics, Systems and Communication, Univ. of Milan-Bicocca, Milan, Italy.
| | - Alessandro Guidi
- Dept. of Informatics, Systems and Communication, Univ. of Milan-Bicocca, Milan, Italy
| | - Gianluca Ascolani
- Dept. of Informatics, Systems and Communication, Univ. of Milan-Bicocca, Milan, Italy
| | - Alberto d'Onofrio
- Department of Mathematics and Geosciences, Univ. of Trieste, Trieste, Italy
| | - Marco Antoniotti
- Dept. of Informatics, Systems and Communication, Univ. of Milan-Bicocca, Milan, Italy.,Bicocca Bioinformatics, Biostatistics and Bioimaging Centre (B4), Milan, Italy
| | - Alex Graudenzi
- Dept. of Informatics, Systems and Communication, Univ. of Milan-Bicocca, Milan, Italy.,Bicocca Bioinformatics, Biostatistics and Bioimaging Centre (B4), Milan, Italy.,Inst. of Molecular Bioimaging and Physiology, National Research Council (IBFM-CNR), Segrate, Italy
| |
Collapse
|
24
|
van Waaij J, Li Z, Wiuf C. Estimation of the covariance structure from SNP allele frequencies. Stat Appl Genet Mol Biol 2022; 21:sagmb-2022-0005. [DOI: 10.1515/sagmb-2022-0005] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/21/2022] [Accepted: 05/02/2022] [Indexed: 11/15/2022]
Abstract
Abstract
We propose two new statistics,
V
̂
$\hat{V}$
and
S
̂
$\hat{S}$
, to disentangle the population history of related populations from SNP frequency data. If the populations are related by a tree, we show by theoretical means as well as by simulation that the new statistics are able to identify the root of a tree correctly, in contrast to standard statistics, such as the observed matrix of F
2-statistics (distances between pairs of populations). The statistic
V
̂
$\hat{V}$
is obtained by averaging over all SNPs (similar to standard statistics). Its expectation is the true covariance matrix of the observed population SNP frequencies, offset by a matrix with identical entries. In contrast, the statistic
S
̂
$\hat{S}$
is put in a Bayesian context and is obtained by averaging over pairs of SNPs, such that each SNP is only used once. It thus makes use of the joint distribution of pairs of SNPs. In addition, we provide a number of novel mathematical results about old and new statistics, and their mutual relationship.
Collapse
Affiliation(s)
- Jan van Waaij
- Department of Mathematical Science , University of Copenhagen , Copenhagen 2100 , Denmark
| | - Zilong Li
- Department of Biology , University of Copenhagen , Copenhagen 2100 , Denmark
| | - Carsten Wiuf
- Department of Mathematical Science , University of Copenhagen , Copenhagen 2100 , Denmark
| |
Collapse
|
25
|
Feng X, Chen L. SCSilicon: a tool for synthetic single-cell DNA sequencing data generation. BMC Genomics 2022; 23:359. [PMID: 35546390 PMCID: PMC9092674 DOI: 10.1186/s12864-022-08566-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/18/2022] [Accepted: 04/19/2022] [Indexed: 11/25/2022] Open
Abstract
Background Single-cell DNA sequencing is getting indispensable in the study of cell-specific cancer genomics. The performance of computational tools that tackle single-cell genome aberrations may be nevertheless undervalued or overvalued, owing to the insufficient size of benchmarking data. In silicon simulation is a cost-effective approach to generate as many single-cell genomes as possible in a controlled manner to make reliable and valid benchmarking. Results This study proposes a new tool, SCSilicon, which efficiently generates single-cell in silicon DNA reads with minimum manual intervention. SCSilicon automatically creates a set of genomic aberrations, including SNP, SNV, Indel, and CNV. Besides, SCSilicon yields the ground truth of CNV segmentation breakpoints and subclone cell labels. We have manually inspected a series of synthetic variations. We conducted a sanity check of the start-of-the-art single-cell CNV callers and found SCYN was the most robust one. Conclusions SCSilicon is a user-friendly software package for users to develop and benchmark single-cell CNV callers. Source code of SCSilicon is available at https://github.com/xikanfeng2/SCSilicon. Supplementary Information The online version contains supplementary material available at (10.1186/s12864-022-08566-w).
Collapse
Affiliation(s)
- Xikang Feng
- School of Software, Northwestern Polytechnical University, Xi'an, Shaanxi, 710072, China.
| | - Lingxi Chen
- Department of Computer Science, City University of Hong Kong, Tat Chee Avenue, Kowloon, Hong Kong, China
| |
Collapse
|
26
|
Pfeifer JD, Loberg R, Lofton-Day C, Zehnbauer BA. Reference Samples to Compare Next-Generation Sequencing Test Performance for Oncology Therapeutics and Diagnostics. Am J Clin Pathol 2022; 157:628-638. [PMID: 34871357 DOI: 10.1093/ajcp/aqab164] [Citation(s) in RCA: 6] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/22/2021] [Accepted: 08/24/2021] [Indexed: 11/15/2022] Open
Abstract
OBJECTIVES Diversity of laboratory-developed tests (LDTs) using next-generation sequencing (NGS) raises concerns about their accuracy for selection of targeted therapies. A working group developed a pilot study of traceable reference samples to measure NGS LDT performance among a cohort of clinical laboratories. METHODS Human cell lines were engineered via CRISPR/Cas9 and prepared as formalin-fixed, paraffin-embedded cell pellets ("wet" samples) to assess the entire NGS test cycle. In silico mutagenized NGS sequence files ("dry" samples) were used to assess the bioinformatics component of the NGS test cycle. Single and multinucleotide variants (n = 36) of KRAS and NRAS were tested at 5% or 15% variant allele fraction to determine eligibility for therapy with the EGFR inhibitor panitumumab in the setting of metastatic colorectal cancer. RESULTS Twenty-one (21/21) laboratories tested wet samples; 19 of 21 analyzed dry samples. Of the laboratories that tested both the wet and dry samples, 7 (37%) of 19 laboratories correctly reported all variants, 3 (16%) of 19 had fewer than five errors, and 9 (47%) of 19 had five or more errors. Most errors were false negatives. CONCLUSIONS Genetically engineered cell lines and mutagenized sequence files are complementary reference samples for evaluating NGS test performance among clinical laboratories using LDTs. Variable accuracy in detection of genetic variants among some LDTs may identify different patient populations for targeted therapy.
Collapse
Affiliation(s)
- John D Pfeifer
- Department of Pathology, Washington University School of Medicine, St Louis, MO, USA
| | - Robert Loberg
- Clinical Biomarkers and Diagnostics, Thousand Oaks, CA, USA
| | | | - Barbara A Zehnbauer
- Department of Pathology, Emory University School of Medicine, Atlanta, GA, USA
| |
Collapse
|
27
|
Petrillo M, Fabbri M, Kagkli DM, Querci M, Van den Eede G, Alm E, Aytan-Aktug D, Capella-Gutierrez S, Carrillo C, Cestaro A, Chan KG, Coque T, Endrullat C, Gut I, Hammer P, Kay GL, Madec JY, Mather AE, McHardy AC, Naas T, Paracchini V, Peter S, Pightling A, Raffael B, Rossen J, Ruppé E, Schlaberg R, Vanneste K, Weber LM, Westh H, Angers-Loustau A. A roadmap for the generation of benchmarking resources for antimicrobial resistance detection using next generation sequencing. F1000Res 2022; 10:80. [PMID: 35847383 PMCID: PMC9243550 DOI: 10.12688/f1000research.39214.2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Accepted: 03/10/2022] [Indexed: 11/20/2022] Open
Abstract
Next Generation Sequencing technologies significantly impact the field of Antimicrobial Resistance (AMR) detection and monitoring, with immediate uses in diagnosis and risk assessment. For this application and in general, considerable challenges remain in demonstrating sufficient trust to act upon the meaningful information produced from raw data, partly because of the reliance on bioinformatics pipelines, which can produce different results and therefore lead to different interpretations. With the constant evolution of the field, it is difficult to identify, harmonise and recommend specific methods for large-scale implementations over time. In this article, we propose to address this challenge through establishing a transparent, performance-based, evaluation approach to provide flexibility in the bioinformatics tools of choice, while demonstrating proficiency in meeting common performance standards. The approach is two-fold: first, a community-driven effort to establish and maintain “live” (dynamic) benchmarking platforms to provide relevant performance metrics, based on different use-cases, that would evolve together with the AMR field; second, agreed and defined datasets to allow the pipelines’ implementation, validation, and quality-control over time. Following previous discussions on the main challenges linked to this approach, we provide concrete recommendations and future steps, related to different aspects of the design of benchmarks, such as the selection and the characteristics of the datasets (quality, choice of pathogens and resistances, etc.), the evaluation criteria of the pipelines, and the way these resources should be deployed in the community.
Collapse
Affiliation(s)
| | - Marco Fabbri
- European Commission Joint Research Centre, Ispra, Italy
| | | | | | - Guy Van den Eede
- European Commission Joint Research Centre, Ispra, Italy
- European Commission Joint Research Centre, Geel, Belgium
| | - Erik Alm
- The European Centre for Disease Prevention and Control, Stockholm, Sweden
| | - Derya Aytan-Aktug
- National Food Institute, Technical University of Denmark, Lyngby, Denmark
| | | | - Catherine Carrillo
- Ottawa Laboratory – Carling, Canadian Food Inspection Agency, Ottawa, Ontario, Canada
| | | | - Kok-Gan Chan
- International Genome Centre, Jiangsu University, Zhenjiang, China
- Division of Genetics and Molecular Biology, Institute of Biological Sciences, Faculty of Science, University of Malaya, Kuala Lumpur, Malaysia
| | - Teresa Coque
- Servicio de Microbiología, Hospital Universitario Ramón y Cajal, Instituto Ramón y Cajal de Investigación Sanitaria (IRYCIS), Madrid, Spain
- Spanish Consortium for Research on Epidemiology and Public Health (CIBERESP), Carlos III Health Institute, Madrid, Spain
| | | | - Ivo Gut
- Centro Nacional de Análisis Genómico, Centre for Genomic Regulation (CNAG-CRG), Barcelona Institute of Technology, Barcelona, Spain
- Universitat Pompeu Fabra, Barcelona, Spain
| | - Paul Hammer
- BIOMES. NGS GmbH c/o Technische Hochschule Wildau, Wildau, Germany
| | - Gemma L. Kay
- Quadram Institute Bioscience, Norwich Research Park, Norwich, UK
| | - Jean-Yves Madec
- Unité Antibiorésistance et Virulence Bactériennes, ANSES Site de Lyon, Lyon, France
| | - Alison E. Mather
- Quadram Institute Bioscience, Norwich Research Park, Norwich, UK
- University of East Anglia, Norwich, UK
| | | | - Thierry Naas
- French-NRC for CPEs, Service de Bactériologie-Hygiène, Hôpital de Bicêtre, Le Kremlin-Bicêtre, France
| | | | - Silke Peter
- Institute of Medical Microbiology and Hygiene, University of Tübingen, Tübingen, Germany
| | - Arthur Pightling
- Center for Food Safety and Applied Nutrition, US Food and Drug Administration, College Park, MD, USA
| | | | - John Rossen
- Department of Medical Microbiology, University Medical Center Groningen, University of Groningen, Groningen, The Netherlands
| | | | - Robert Schlaberg
- Department of Pathology, University of Utah, Salt Lake City, UT, USA
| | - Kevin Vanneste
- Transversal activities in Applied Genomics, Sciensano, Brussels, Belgium
| | - Lukas M. Weber
- Institute of Molecular Life Sciences, University of Zurich, Zurich, Switzerland
- SIB Swiss Institute of Bioinformatics, University of Zurich, Zurich, Switzerland
- Present address: Department of Biostatistics, Johns Hopkins Bloomberg School of Public Health, Baltimore, MD, USA
| | | | | |
Collapse
|
28
|
Liu Z, Roberts R, Mercer TR, Xu J, Sedlazeck FJ, Tong W. Towards accurate and reliable resolution of structural variants for clinical diagnosis. Genome Biol 2022; 23:68. [PMID: 35241127 PMCID: PMC8892125 DOI: 10.1186/s13059-022-02636-8] [Citation(s) in RCA: 28] [Impact Index Per Article: 14.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/27/2021] [Accepted: 02/15/2022] [Indexed: 12/17/2022] Open
Abstract
Structural variants (SVs) are a major source of human genetic diversity and have been associated with different diseases and phenotypes. The detection of SVs is difficult, and a diverse range of detection methods and data analysis protocols has been developed. This difficulty and diversity make the detection of SVs for clinical applications challenging and requires a framework to ensure accuracy and reproducibility. Here, we discuss current developments in the diagnosis of SVs and propose a roadmap for the accurate and reproducible detection of SVs that includes case studies provided from the FDA-led SEquencing Quality Control Phase II (SEQC-II) and other consortium efforts.
Collapse
Affiliation(s)
- Zhichao Liu
- National Center for Toxicological Research, U.S. Food and Drug Administration, Jefferson, AR, 72079, USA
| | - Ruth Roberts
- ApconiX, BioHub at Alderley Park, Alderley Edge, SK10 4TG, UK
- University of Birmingham, Edgbaston, Birmingham, B15 2TT, UK
| | - Timothy R Mercer
- Australian Institute for Bioengineering and Nanotechnology, University of Queensland, Brisbane, QLD, Australia
- Garvan Institute of Medical Research, Sydney, NSW, Australia
- St Vincent's Clinical School, University of New South Wales, Sydney, NSW, Australia
| | - Joshua Xu
- National Center for Toxicological Research, U.S. Food and Drug Administration, Jefferson, AR, 72079, USA
| | - Fritz J Sedlazeck
- Human Genome Sequencing Center, Baylor College of Medicine, One Baylor Plaza, Houston, TX, 77030, USA.
| | - Weida Tong
- National Center for Toxicological Research, U.S. Food and Drug Administration, Jefferson, AR, 72079, USA.
| |
Collapse
|
29
|
Wan Y, Zong C, Li X, Wang A, Li Y, Yang T, Bao Q, Dubow M, Yang M, Rodrigo LA, Mao C. New Insights for Biosensing: Lessons from Microbial Defense Systems. Chem Rev 2022; 122:8126-8180. [PMID: 35234463 DOI: 10.1021/acs.chemrev.1c01063] [Citation(s) in RCA: 13] [Impact Index Per Article: 6.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/15/2022]
Abstract
Microorganisms have gained defense systems during the lengthy process of evolution over millions of years. Such defense systems can protect them from being attacked by invading species (e.g., CRISPR-Cas for establishing adaptive immune systems and nanopore-forming toxins as virulence factors) or enable them to adapt to different conditions (e.g., gas vesicles for achieving buoyancy control). These microorganism defense systems (MDS) have inspired the development of biosensors that have received much attention in a wide range of fields including life science research, food safety, and medical diagnosis. This Review comprehensively analyzes biosensing platforms originating from MDS for sensing and imaging biological analytes. We first describe a basic overview of MDS and MDS-inspired biosensing platforms (e.g., CRISPR-Cas systems, nanopore-forming proteins, and gas vesicles), followed by a critical discussion of their functions and properties. We then discuss several transduction mechanisms (optical, acoustic, magnetic, and electrical) involved in MDS-inspired biosensing. We further detail the applications of the MDS-inspired biosensors to detect a variety of analytes (nucleic acids, peptides, proteins, pathogens, cells, small molecules, and metal ions). In the end, we propose the key challenges and future perspectives in seeking new and improved MDS tools that can potentially lead to breakthrough discoveries in developing a new generation of biosensors with a combination of low cost; high sensitivity, accuracy, and precision; and fast detection. Overall, this Review gives a historical review of MDS, elucidates the principles of emulating MDS to develop biosensors, and analyzes the recent advancements, current challenges, and future trends in this field. It provides a unique critical analysis of emulating MDS to develop robust biosensors and discusses the design of such biosensors using elements found in MDS, showing that emulating MDS is a promising approach to conceptually advancing the design of biosensors.
Collapse
Affiliation(s)
- Yi Wan
- State Key Laboratory of Marine Resource Utilization in the South China Sea, School of Pharmaceutical Sciences, Marine College, Hainan University, Haikou 570228, P. R. China
| | - Chengli Zong
- State Key Laboratory of Marine Resource Utilization in the South China Sea, School of Pharmaceutical Sciences, Marine College, Hainan University, Haikou 570228, P. R. China
| | - Xiangpeng Li
- Department of Bioengineering and Therapeutic Sciences, Schools of Medicine and Pharmacy, University of California, San Francisco, 1700 Fourth Street, Byers Hall 303C, San Francisco, California 94158, United States
| | - Aimin Wang
- State Key Laboratory of Marine Resource Utilization in the South China Sea, School of Pharmaceutical Sciences, Marine College, Hainan University, Haikou 570228, P. R. China
| | - Yan Li
- College of Animal Science, Zhejiang University, Hangzhou, Zhejiang 310058, P. R. China
| | - Tao Yang
- School of Materials Science and Engineering, Zhejiang University, Hangzhou, Zhejiang 310058, P. R. China
| | - Qing Bao
- School of Materials Science and Engineering, Zhejiang University, Hangzhou, Zhejiang 310058, P. R. China
| | - Michael Dubow
- Institute for Integrative Biology of the Cell (I2BC), UMR 9198 CNRS, CEA, Université Paris-Saclay, Campus C.N.R.S, Bâtiment 12, Avenue de la Terrasse, 91190 Gif-sur-Yvette, France
| | - Mingying Yang
- College of Animal Science, Zhejiang University, Hangzhou, Zhejiang 310058, P. R. China
| | - Ledesma-Amaro Rodrigo
- Imperial College Centre for Synthetic Biology, Department of Bioengineering, Imperial College London, London SW7 2AZ, United Kingdom
| | - Chuanbin Mao
- Department of Chemistry & Biochemistry, Stephenson Life Science Research Center, University of Oklahoma, 101 Stephenson Parkway, Norman, Oklahoma 73019, United States.,School of Materials Science and Engineering, Zhejiang University, Hangzhou, Zhejiang 310058, P. R. China
| |
Collapse
|
30
|
Diricks M, Kohl TA, Käding N, Leshchinskiy V, Hauswaldt S, Jiménez Vázquez O, Utpatel C, Niemann S, Rupp J, Merker M. Whole genome sequencing-based classification of human-related Haemophilus species and detection of antimicrobial resistance genes. Genome Med 2022; 14:13. [PMID: 35139905 PMCID: PMC8830169 DOI: 10.1186/s13073-022-01017-x] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/31/2021] [Accepted: 01/24/2022] [Indexed: 12/31/2022] Open
Abstract
Background Bacteria belonging to the genus Haemophilus cause a wide range of diseases in humans. Recently, H. influenzae was classified by the WHO as priority pathogen due to the wide spread of ampicillin resistant strains. However, other Haemophilus spp. are often misclassified as H. influenzae. Therefore, we established an accurate and rapid whole genome sequencing (WGS) based classification and serotyping algorithm and combined it with the detection of resistance genes. Methods A gene presence/absence-based classification algorithm was developed, which employs the open-source gene-detection tool SRST2 and a new classification database comprising 36 genes, including capsule loci for serotyping. These genes were identified using a comparative genome analysis of 215 strains belonging to ten human-related Haemophilus (sub)species (training dataset). The algorithm was evaluated on 1329 public short read datasets (evaluation dataset) and used to reclassify 262 clinical Haemophilus spp. isolates from 250 patients (German cohort). In addition, the presence of antibiotic resistance genes within the German dataset was evaluated with SRST2 and correlated with results of traditional phenotyping assays. Results The newly developed algorithm can differentiate between clinically relevant Haemophilus species including, but not limited to, H. influenzae, H. haemolyticus, and H. parainfluenzae. It can also identify putative haemin-independent H. haemolyticus strains and determine the serotype of typeable Haemophilus strains. The algorithm performed excellently in the evaluation dataset (99.6% concordance with reported species classification and 99.5% with reported serotype) and revealed several misclassifications. Additionally, 83 out of 262 (31.7%) suspected H. influenzae strains from the German cohort were in fact H. haemolyticus strains, some of which associated with mouth abscesses and lower respiratory tract infections. Resistance genes were detected in 16 out of 262 datasets from the German cohort. Prediction of ampicillin resistance, associated with blaTEM-1D, and tetracycline resistance, associated with tetB, correlated well with available phenotypic data. Conclusions Our new classification database and algorithm have the potential to improve diagnosis and surveillance of Haemophilus spp. and can easily be coupled with other public genotyping and antimicrobial resistance databases. Our data also point towards a possible pathogenic role of H. haemolyticus strains, which needs to be further investigated. Supplementary Information The online version contains supplementary material available at 10.1186/s13073-022-01017-x.
Collapse
Affiliation(s)
- Margo Diricks
- Molecular and Experimental Mycobacteriology, Research Center Borstel, Borstel, Germany.,German Center for Infection Research (DZIF), Partner Site Hamburg-Lübeck-Borstel-Riems, Hamburg, Germany
| | - Thomas A Kohl
- Molecular and Experimental Mycobacteriology, Research Center Borstel, Borstel, Germany.,German Center for Infection Research (DZIF), Partner Site Hamburg-Lübeck-Borstel-Riems, Hamburg, Germany
| | - Nadja Käding
- Department of Infectious Diseases and Microbiology, University Hospital Schleswig-Holstein, Lübeck, Germany.,German Center for Infection Research (DZIF), TTU HAARBI, Lübeck, Germany
| | - Vladislav Leshchinskiy
- Department of Infectious Diseases and Microbiology, University Hospital Schleswig-Holstein, Lübeck, Germany
| | - Susanne Hauswaldt
- Department of Infectious Diseases and Microbiology, University Hospital Schleswig-Holstein, Lübeck, Germany
| | - Omar Jiménez Vázquez
- Molecular and Experimental Mycobacteriology, Research Center Borstel, Borstel, Germany
| | - Christian Utpatel
- Molecular and Experimental Mycobacteriology, Research Center Borstel, Borstel, Germany.,German Center for Infection Research (DZIF), Partner Site Hamburg-Lübeck-Borstel-Riems, Hamburg, Germany
| | - Stefan Niemann
- Molecular and Experimental Mycobacteriology, Research Center Borstel, Borstel, Germany.,German Center for Infection Research (DZIF), Partner Site Hamburg-Lübeck-Borstel-Riems, Hamburg, Germany
| | - Jan Rupp
- Department of Infectious Diseases and Microbiology, University Hospital Schleswig-Holstein, Lübeck, Germany.,German Center for Infection Research (DZIF), TTU HAARBI, Lübeck, Germany
| | - Matthias Merker
- Molecular and Experimental Mycobacteriology, Research Center Borstel, Borstel, Germany. .,German Center for Infection Research (DZIF), Partner Site Hamburg-Lübeck-Borstel-Riems, Hamburg, Germany. .,Evolution of the Resistome, Research Center Borstel, Borstel, Germany.
| |
Collapse
|
31
|
Liu J, Shen Q, Bao H. Comparison of seven SNP calling pipelines for the next-generation sequencing data of chickens. PLoS One 2022; 17:e0262574. [PMID: 35100292 PMCID: PMC8803190 DOI: 10.1371/journal.pone.0262574] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/25/2021] [Accepted: 12/29/2021] [Indexed: 11/18/2022] Open
Abstract
Single nucleotide polymorphisms (SNPs) are widely used in genome-wide association studies and population genetics analyses. Next-generation sequencing (NGS) has become convenient, and many SNP-calling pipelines have been developed for human NGS data. We took advantage of a gap knowledge in selecting the appropriated SNP calling pipeline to handle with high-throughput NGS data. To fill this gap, we studied and compared seven SNP calling pipelines, which include 16GT, genome analysis toolkit (GATK), Bcftools-single (Bcftools single sample mode), Bcftools-multiple (Bcftools multiple sample mode), VarScan2-single (VarScan2 single sample mode), VarScan2-multiple (VarScan2 multiple sample mode) and Freebayes pipelines, using 96 NGS data with the different depth gradients of approximately 5X, 10X, 20X, 30X, 40X, and 50X coverage from 16 Rhode Island Red chickens. The sixteen chickens were also genotyped with a 50K SNP array, and the sensitivity and specificity of each pipeline were assessed by comparison to the results of SNP arrays. For each pipeline, except Freebayes, the number of detected SNPs increased as the input read depth increased. In comparison with other pipelines, 16GT, followed by Bcftools-multiple, obtained the most SNPs when the input coverage exceeded 10X, and Bcftools-multiple obtained the most when the input was 5X and 10X. The sensitivity and specificity of each pipeline increased with increasing input. Bcftools-multiple had the highest sensitivity numerically when the input ranged from 5X to 30X, and 16GT showed the highest sensitivity when the input was 40X and 50X. Bcftools-multiple also had the highest specificity, followed by GATK, at almost all input levels. For most calling pipelines, there were no obvious changes in SNP numbers, sensitivities or specificities beyond 20X. In conclusion, (1) if only SNPs were detected, the sequencing depth did not need to exceed 20X; (2) the Bcftools-multiple may be the best choice for detecting SNPs from chicken NGS data, but for a single sample or sequencing depth greater than 20X, 16GT was recommended. Our findings provide a reference for researchers to select suitable pipelines to obtain SNPs from the NGS data of chickens or nonhuman animals.
Collapse
Affiliation(s)
- Jing Liu
- National Engineering Laboratory for Animal Breeding, Beijing Key Laboratory for Animal Genetic Improvement, College of Animal Science and Technology, China Agricultural University, Beijing, China
| | - Qingmiao Shen
- National Engineering Laboratory for Animal Breeding, Beijing Key Laboratory for Animal Genetic Improvement, College of Animal Science and Technology, China Agricultural University, Beijing, China
| | - Haigang Bao
- National Engineering Laboratory for Animal Breeding, Beijing Key Laboratory for Animal Genetic Improvement, College of Animal Science and Technology, China Agricultural University, Beijing, China
- * E-mail:
| |
Collapse
|
32
|
Chen J, Li F, Wang M, Li J, Marquez-Lago TT, Leier A, Revote J, Li S, Liu Q, Song J. BigFiRSt: A Software Program Using Big Data Technique for Mining Simple Sequence Repeats From Large-Scale Sequencing Data. Front Big Data 2022; 4:727216. [PMID: 35118375 PMCID: PMC8805145 DOI: 10.3389/fdata.2021.727216] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/18/2021] [Accepted: 12/13/2021] [Indexed: 11/22/2022] Open
Abstract
Background Simple Sequence Repeats (SSRs) are short tandem repeats of nucleotide sequences. It has been shown that SSRs are associated with human diseases and are of medical relevance. Accordingly, a variety of computational methods have been proposed to mine SSRs from genomes. Conventional methods rely on a high-quality complete genome to identify SSRs. However, the sequenced genome often misses several highly repetitive regions. Moreover, many non-model species have no entire genomes. With the recent advances of next-generation sequencing (NGS) techniques, large-scale sequence reads for any species can be rapidly generated using NGS. In this context, a number of methods have been proposed to identify thousands of SSR loci within large amounts of reads for non-model species. While the most commonly used NGS platforms (e.g., Illumina platform) on the market generally provide short paired-end reads, merging overlapping paired-end reads has become a common way prior to the identification of SSR loci. This has posed a big data analysis challenge for traditional stand-alone tools to merge short read pairs and identify SSRs from large-scale data. Results In this study, we present a new Hadoop-based software program, termed BigFiRSt, to address this problem using cutting-edge big data technology. BigFiRSt consists of two major modules, BigFLASH and BigPERF, implemented based on two state-of-the-art stand-alone tools, FLASH and PERF, respectively. BigFLASH and BigPERF address the problem of merging short read pairs and mining SSRs in the big data manner, respectively. Comprehensive benchmarking experiments show that BigFiRSt can dramatically reduce the execution times of fast read pairs merging and SSRs mining from very large-scale DNA sequence data. Conclusions The excellent performance of BigFiRSt mainly resorts to the Big Data Hadoop technology to merge read pairs and mine SSRs in parallel and distributed computing on clusters. We anticipate BigFiRSt will be a valuable tool in the coming biological Big Data era.
Collapse
Affiliation(s)
- Jinxiang Chen
- Department of Software Engineering, College of Information Engineering, Northwest A&F University, Yangling, China
| | - Fuyi Li
- Department of Biochemistry and Molecular Biology, Biomedicine Discovery Institute, Monash University, Melbourne, VIC, Australia
- Monash Centre for Data Science, Monash University, Melbourne, VIC, Australia
- Department of Microbiology and Immunity, The Peter Doherty Institute for Infection and Immunity, The University of Melbourne, Melbourne, VIC, Australia
| | - Miao Wang
- Department of Software Engineering, College of Information Engineering, Northwest A&F University, Yangling, China
| | - Junlong Li
- Department of Software Engineering, College of Information Engineering, Northwest A&F University, Yangling, China
| | - Tatiana T. Marquez-Lago
- Department of Genetics, School of Medicine, University of Alabama at Birmingham, Birmingham, AL, United States
- Department of Cell, Developmental and Integrative Biology, School of Medicine, University of Alabama at Birmingham, Birmingham, AL, United States
| | - André Leier
- Department of Genetics, School of Medicine, University of Alabama at Birmingham, Birmingham, AL, United States
- Department of Cell, Developmental and Integrative Biology, School of Medicine, University of Alabama at Birmingham, Birmingham, AL, United States
| | - Jerico Revote
- Department of Biochemistry and Molecular Biology, Biomedicine Discovery Institute, Monash University, Melbourne, VIC, Australia
| | - Shuqin Li
- Department of Software Engineering, College of Information Engineering, Northwest A&F University, Yangling, China
| | - Quanzhong Liu
- Department of Software Engineering, College of Information Engineering, Northwest A&F University, Yangling, China
- Quanzhong Liu
| | - Jiangning Song
- Department of Biochemistry and Molecular Biology, Biomedicine Discovery Institute, Monash University, Melbourne, VIC, Australia
- Monash Centre for Data Science, Monash University, Melbourne, VIC, Australia
- *Correspondence: Jiangning Song
| |
Collapse
|
33
|
Suminda GGD, Bhandari S, Won Y, Goutam U, Kanth Pulicherla K, Son YO, Ghosh M. High-throughput sequencing technologies in the detection of livestock pathogens, diagnosis, and zoonotic surveillance. Comput Struct Biotechnol J 2022; 20:5378-5392. [PMID: 36212529 PMCID: PMC9526013 DOI: 10.1016/j.csbj.2022.09.028] [Citation(s) in RCA: 18] [Impact Index Per Article: 9.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/19/2022] [Revised: 09/20/2022] [Accepted: 09/21/2022] [Indexed: 12/03/2022] Open
Abstract
Increasing globalization, agricultural intensification, urbanization, and climatic changes have resulted in a significant recent increase in emerging infectious zoonotic diseases. Zoonotic diseases are becoming more common, so innovative, effective, and integrative research is required to better understand their transmission, ecological implications, and dynamics at wildlife-human interfaces. High-throughput sequencing (HTS) methodologies have enormous potential for unraveling these contingencies and improving our understanding, but they are only now beginning to be realized in livestock research. This study investigates the current state of use of sequencing technologies in the detection of livestock pathogens such as bovine, dogs (Canis lupus familiaris), sheep (Ovis aries), pigs (Sus scrofa), horses (Equus caballus), chicken (Gallus gallus domesticus), and ducks (Anatidae) as well as how it can improve the monitoring and detection of zoonotic infections. We also described several high-throughput sequencing approaches for improved detection of known, unknown, and emerging infectious agents, resulting in better infectious disease diagnosis, as well as surveillance of zoonotic infectious diseases. In the coming years, the continued advancement of sequencing technologies will improve livestock research and hasten the development of various new genomic and technological studies on farm animals.
Collapse
|
34
|
Single-Cell Transcriptome Profiling Simulation Reveals the Impact of Sequencing Parameters and Algorithms on Clustering. Life (Basel) 2021; 11:life11070716. [PMID: 34357088 PMCID: PMC8304014 DOI: 10.3390/life11070716] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/08/2021] [Revised: 07/09/2021] [Accepted: 07/15/2021] [Indexed: 11/16/2022] Open
Abstract
Despite the scRNA-seq analytic algorithms developed, their performance for cell clustering cannot be quantified due to the unknown "true" clusters. Referencing the transcriptomic heterogeneity of cell clusters, a "true" mRNA number matrix of cell individuals was defined as ground truth. Based on the matrix and the actual data generation procedure, a simulation program (SSCRNA) for raw data was developed. Subsequently, the consistency between simulated data and real data was evaluated. Furthermore, the impact of sequencing depth and algorithms for analyses on cluster accuracy was quantified. As a result, the simulation result was highly consistent with that of the actual data. Among the clustering algorithms, the Gaussian normalization method was the more recommended. As for the clustering algorithms, the K-means clustering method was more stable than K-means plus Louvain clustering. In conclusion, the scRNA simulation algorithm developed restores the actual data generation process, discovers the impact of parameters on classification, compares the normalization/clustering algorithms, and provides novel insight into scRNA analyses.
Collapse
|
35
|
Seaby EG, Ennis S. Challenges in the diagnosis and discovery of rare genetic disorders using contemporary sequencing technologies. Brief Funct Genomics 2021; 19:243-258. [PMID: 32393978 DOI: 10.1093/bfgp/elaa009] [Citation(s) in RCA: 18] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/13/2022] Open
Abstract
Next generation sequencing (NGS) has revolutionised rare disease diagnostics. Concomitant with advancing technologies has been a rise in the number of new gene disorders discovered and diagnoses made for patients and their families. However, despite the trend towards whole exome and whole genome sequencing, diagnostic rates remain suboptimal. On average, only ~30% of patients receive a molecular diagnosis. National sequencing projects launched in the last 5 years are integrating clinical diagnostic testing with research avenues to widen the spectrum of known genetic disorders. Consequently, efforts to diagnose genetic disorders in a clinical setting are now often shared with efforts to prioritise candidate variants for the detection of new disease genes. Herein we discuss some of the biggest obstacles precluding molecular diagnosis and discovery of new gene disorders. We consider bioinformatic and analytical challenges faced when interpreting next generation sequencing data and showcase some of the newest tools available to mitigate these issues. We consider how incomplete penetrance, non-coding variation and structural variants are likely to impact diagnostic rates, and we further discuss methods for uplifting novel gene discovery by adopting a gene-to-patient-based approach.
Collapse
|
36
|
Affiliation(s)
- Matthew S Lebo
- Bioinformatics and Laboratory of Molecular Medicine, Partners Personalized Medicine, 65 Landsdowne Street, Cambridge, MA 02139, USA; Pathology, Harvard Medical School, 25 Shattuck Street, Boston, MA 02115, USA; Pathology, Brigham and Women's Hospital, 75 Francis Street, Boston, MA 02115, USA.
| | - Limin Hao
- Bioinformatics and Laboratory of Molecular Medicine, Partners Personalized Medicine, 65 Landsdowne Street, Cambridge, MA 02139, USA
| | - Chiao-Feng Lin
- Bioinformatics and Laboratory of Molecular Medicine, Partners Personalized Medicine, 65 Landsdowne Street, Cambridge, MA 02139, USA
| | - Arti Singh
- Bioinformatics and Laboratory of Molecular Medicine, Partners Personalized Medicine, 65 Landsdowne Street, Cambridge, MA 02139, USA
| |
Collapse
|
37
|
Kühl MA, Stich B, Ries DC. Mutation-Simulator: fine-grained simulation of random mutations in any genome. Bioinformatics 2021; 37:568-569. [PMID: 32780803 PMCID: PMC8088320 DOI: 10.1093/bioinformatics/btaa716] [Citation(s) in RCA: 14] [Impact Index Per Article: 4.7] [Reference Citation Analysis] [Abstract] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/18/2020] [Revised: 06/12/2020] [Accepted: 08/05/2020] [Indexed: 01/11/2023] Open
Abstract
Summary Mutation-Simulator allows the introduction of various types of sequence alterations in reference sequences, with reasonable compute-time even for large eukaryotic genomes. Its intuitive system for fine-grained control over mutation rates along the sequence enables the mimicking of natural mutation patterns. Using standard file formats for input and output data, it can easily be integrated into any development and benchmarking workflow for high-throughput sequencing applications. Availability and implementation Mutation-Simulator is written in Python 3 and the source code, documentation, help and use cases are available on the Github page at https://github.com/mkpython3/Mutation-Simulator. It is free for use under the GPL 3 license.
Collapse
Affiliation(s)
- M A Kühl
- Quantitative Genetics and Genomics of Plants, Heinrich Heine University, Düsseldorf 40225, Germany
| | - B Stich
- Quantitative Genetics and Genomics of Plants, Heinrich Heine University, Düsseldorf 40225, Germany
| | - D C Ries
- Quantitative Genetics and Genomics of Plants, Heinrich Heine University, Düsseldorf 40225, Germany
| |
Collapse
|
38
|
Ono Y, Asai K, Hamada M. PBSIM2: a simulator for long-read sequencers with a novel generative model of quality scores. Bioinformatics 2021; 37:589-595. [PMID: 32976553 PMCID: PMC8097687 DOI: 10.1093/bioinformatics/btaa835] [Citation(s) in RCA: 46] [Impact Index Per Article: 15.3] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/22/2020] [Revised: 08/20/2020] [Accepted: 09/11/2020] [Indexed: 12/21/2022] Open
Abstract
Motivation Recent advances in high-throughput long-read sequencers, such as PacBio and Oxford Nanopore sequencers, produce longer reads with more errors than short-read sequencers. In addition to the high error rates of reads, non-uniformity of errors leads to difficulties in various downstream analyses using long reads. Many useful simulators, which characterize long-read error patterns and simulate them, have been developed. However, there is still room for improvement in the simulation of the non-uniformity of errors. Results To capture characteristics of errors in reads for long-read sequencers, here, we introduce a generative model for quality scores, in which a hidden Markov Model with a latest model selection method, called factorized information criteria, is utilized. We evaluated our developed simulator from various points, indicating that our simulator successfully simulates reads that are consistent with real reads. Availability and implementation The source codes of PBSIM2 are freely available from https://github.com/yukiteruono/pbsim2. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Yukiteru Ono
- Department of Computational Biology and Medical Sciences, Graduate School of Frontier Sciences, University of Tokyo, Kashiwa 277-8561, Japan
| | - Kiyoshi Asai
- Department of Computational Biology and Medical Sciences, Graduate School of Frontier Sciences, University of Tokyo, Kashiwa 277-8561, Japan.,Artificial Intelligence Research Center (AIRC), National Institute of Advanced Industrial Science and Technology (AIST), Tokyo 135-0064, Japan
| | - Michiaki Hamada
- Department of Electrical Engineering and Bioscience, Faculty of Science and Engineering, Waseda University, Tokyo 169-8555, Japan.,Computational Bio Big-Data Open Innovation Laboratory (CBBD-OIL), National Institute of Advanced Industrial Science and Technology (AIST), Tokyo 169-8555, Japan.,Institute for Medical-oriented Structural Biology, Waseda University, Tokyo 162-8480, Japan.,Graduate School of Medicine, Nippon Medical School, Tokyo 113-8602, Japan
| |
Collapse
|
39
|
Bogaerts B, Delcourt T, Soetaert K, Boarbi S, Ceyssens PJ, Winand R, Van Braekel J, De Keersmaecker SCJ, Roosens NHC, Marchal K, Mathys V, Vanneste K. A Bioinformatics Whole-Genome Sequencing Workflow for Clinical Mycobacterium tuberculosis Complex Isolate Analysis, Validated Using a Reference Collection Extensively Characterized with Conventional Methods and In Silico Approaches. J Clin Microbiol 2021; 59:e00202-21. [PMID: 33789960 PMCID: PMC8316078 DOI: 10.1128/jcm.00202-21] [Citation(s) in RCA: 14] [Impact Index Per Article: 4.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/27/2021] [Accepted: 03/27/2021] [Indexed: 01/18/2023] Open
Abstract
The use of whole-genome sequencing (WGS) for routine typing of bacterial isolates has increased substantially in recent years. For Mycobacterium tuberculosis (MTB), in particular, WGS has the benefit of drastically reducing the time required to generate results compared to most conventional phenotypic methods. Consequently, a multitude of solutions for analyzing WGS MTB data have been developed, but their successful integration in clinical and national reference laboratories is hindered by the requirement for their validation, for which a consensus framework is still largely absent. We developed a bioinformatics workflow for (Illumina) WGS-based routine typing of MTB complex (MTBC) member isolates allowing complete characterization, including (sub)species confirmation and identification (16S, csb/RD, hsp65), single nucleotide polymorphism (SNP)-based antimicrobial resistance (AMR) prediction, and pathogen typing (spoligotyping, SNP barcoding, and core genome multilocus sequence typing). Workflow performance was validated on a per-assay basis using a collection of 238 in-house-sequenced MTBC isolates, extensively characterized with conventional molecular biology-based approaches supplemented with public data. For SNP-based AMR prediction, results from molecular genotyping methods were supplemented with in silico modified data sets, allowing us to greatly increase the set of evaluated mutations. The workflow demonstrated very high performance with performance metrics of >99% for all assays, except for spoligotyping, where sensitivity dropped to ∼90%. The validation framework for our WGS-based bioinformatics workflow can aid in the standardization of bioinformatics tools by the MTB community and other SNP-based applications regardless of the targeted pathogen(s). The bioinformatics workflow is available for academic and nonprofit use through the Galaxy instance of our institute at https://galaxy.sciensano.be.
Collapse
Affiliation(s)
- Bert Bogaerts
- Transversal Activities in Applied Genomics, Sciensano, Brussels, Belgium
- Department of Plant Biotechnology and Bioinformatics, Ghent University, Ghent, Belgium
| | - Thomas Delcourt
- Transversal Activities in Applied Genomics, Sciensano, Brussels, Belgium
| | | | | | | | - Raf Winand
- Transversal Activities in Applied Genomics, Sciensano, Brussels, Belgium
| | - Julien Van Braekel
- Transversal Activities in Applied Genomics, Sciensano, Brussels, Belgium
| | | | - Nancy H C Roosens
- Transversal Activities in Applied Genomics, Sciensano, Brussels, Belgium
| | - Kathleen Marchal
- Department of Information Technology, Internet Technology and Data Science Lab (IDLab), Interuniversity Microelectronics Centre (IMEC), Ghent University, Ghent, Belgium
- Department of Plant Biotechnology and Bioinformatics, Ghent University, Ghent, Belgium
- Department of Genetics, University of Pretoria, Pretoria, South Africa
| | | | - Kevin Vanneste
- Transversal Activities in Applied Genomics, Sciensano, Brussels, Belgium
| |
Collapse
|
40
|
Herzig AF, Velo-Suárez L, Le Folgoc G, Boland A, Blanché H, Olaso R, Le Roux L, Delmas C, Goldberg M, Zins M, Lethimonnier F, Deleuze JF, Génin E. Evaluation of saliva as a source of accurate whole-genome and microbiome sequencing data. Genet Epidemiol 2021; 45:537-548. [PMID: 33998042 DOI: 10.1002/gepi.22386] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/26/2020] [Revised: 04/27/2021] [Accepted: 04/27/2021] [Indexed: 11/08/2022]
Abstract
This study sets out to establish the suitability of saliva-based whole-genome sequencing (WGS) through a comparison against blood-based WGS. To fully appraise the observed differences, we developed a novel technique of pseudo-replication. We also investigated the potential of characterizing individual salivary microbiomes from non-human DNA fragments found in saliva. We observed that the majority of discordant genotype calls between blood and saliva fell into known regions of the human genome that are typically sequenced with low confidence; and could be identified by quality control measures. Pseudo-replication demonstrated that the levels of discordance between blood- and saliva-derived WGS data were entirely similar to what one would expect between technical replicates if an individual's blood or saliva had been sequenced twice. Finally, we successfully sequenced salivary microbiomes in parallel to human genomes as demonstrated by a comparison against the Human Microbiome Project.
Collapse
Affiliation(s)
| | - Lourdes Velo-Suárez
- Univ Brest, EFS, UMR 1078, GGB, Inserm, Brest, France.,Brest Center for Microbiota Analysis (CBAM), CHU Brest, Brest, France
| | | | - Anne Boland
- National Center for Research in Human Genomics (CNRGH), François Jacob Institute of Biology, CEA, Paris-Saclay University, Evry, France.,Laboratory of Excellence GENMED (Medical Genomics), Paris, France
| | - Hélène Blanché
- Laboratory of Excellence GENMED (Medical Genomics), Paris, France.,Fondation Jean Dausset-CEPH, Paris, France
| | - Robert Olaso
- National Center for Research in Human Genomics (CNRGH), François Jacob Institute of Biology, CEA, Paris-Saclay University, Evry, France.,Laboratory of Excellence GENMED (Medical Genomics), Paris, France
| | - Liana Le Roux
- Clinical Investigation Center 1412, Inserm, CHU Brest, Brest, France
| | | | - Marcel Goldberg
- Inserm-Paris Saclay University, University of Paris, Villejuif, France
| | - Marie Zins
- Inserm-Paris Saclay University, University of Paris, Villejuif, France
| | - Franck Lethimonnier
- National Alliance for Life and Health Sciences (Aviesan), Multiorganism thematic institute, Health technologies, INSERM, Paris, France
| | - Jean-François Deleuze
- National Center for Research in Human Genomics (CNRGH), François Jacob Institute of Biology, CEA, Paris-Saclay University, Evry, France.,Laboratory of Excellence GENMED (Medical Genomics), Paris, France.,Fondation Jean Dausset-CEPH, Paris, France.,Center of Reference, Innovation and Expertize (CREFIX), US39, French Atomic Energy and Alternative Energies Commission, Evry, France
| | - Emmanuelle Génin
- Univ Brest, EFS, UMR 1078, GGB, Inserm, Brest, France.,CHU Brest, Brest, France
| |
Collapse
|
41
|
Chua PYS, Crampton-Platt A, Lammers Y, Alsos IG, Boessenkool S, Bohmann K. Metagenomics: A viable tool for reconstructing herbivore diet. Mol Ecol Resour 2021; 21:2249-2263. [PMID: 33971086 PMCID: PMC8518049 DOI: 10.1111/1755-0998.13425] [Citation(s) in RCA: 9] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/13/2020] [Revised: 04/08/2021] [Accepted: 05/04/2021] [Indexed: 11/28/2022]
Abstract
Metagenomics can generate data on the diet of herbivores, without the need for primer selection and PCR enrichment steps as is necessary in metabarcoding. Metagenomic approaches to diet analysis have remained relatively unexplored, requiring validation of bioinformatic steps. Currently, no metagenomic herbivore diet studies have utilized both chloroplast and nuclear markers as reference sequences for plant identification, which would increase the number of reads that could be taxonomically informative. Here, we explore how in silico simulation of metagenomic data sets resembling sequences obtained from faecal samples can be used to validate taxonomic assignment. Using a known list of sequences to create simulated data sets, we derived reliable identification parameters for taxonomic assignments of sequences. We applied these parameters to characterize the diet of western capercaillies (Tetrao urogallus) located in Norway, and compared the results with metabarcoding trnL P6 loop data generated from the same samples. Both methods performed similarly in the number of plant taxa identified (metagenomics 42 taxa, metabarcoding 43 taxa), with no significant difference in species resolution (metagenomics 24%, metabarcoding 23%). We further observed that while metagenomics was strongly affected by the age of faecal samples, with fresh samples outperforming old samples, metabarcoding was not affected by sample age. On the other hand, metagenomics allowed us to simultaneously obtain the mitochondrial genome of the western capercaillies, thereby providing additional ecological information. Our study demonstrates the potential of utilizing metagenomics for diet reconstruction but also highlights key considerations as compared to metabarcoding for future utilization of this technique.
Collapse
Affiliation(s)
- Physilia Y S Chua
- Section for Evolutionary Genomics, Globe Institute, University of Copenhagen, Copenhagen, Denmark.,Department of Biology, Faculty of Science, University of Copenhagen, Copenhagen, Denmark
| | | | - Youri Lammers
- Tromsø Museum, UiT - The Arctic University of Norway, Tromsø, Norway
| | - Inger G Alsos
- Tromsø Museum, UiT - The Arctic University of Norway, Tromsø, Norway
| | - Sanne Boessenkool
- Centre for Ecological and Evolutionary Synthesis (CEES), Department of Biosciences, University of Oslo, Oslo, Norway
| | - Kristine Bohmann
- Section for Evolutionary Genomics, Globe Institute, University of Copenhagen, Copenhagen, Denmark
| |
Collapse
|
42
|
Abstract
We describe the incorporation of gated ion channels into probes for scanning ion conductance microscopy (SICM) as a robust platform for collecting spatial information at interfaces. Specifically, a dual-barrel pipet is used, where one barrel controls the pipet position and the second barrel houses voltage-gated transient receptor potential vanilloid 1 (TRPV1) channels excised in a sniffer-patch configuration. Spatially resolved sensing with TRPV1 channels is demonstrated by imaging a porous membrane where a transmembrane potential across the membrane generates local electric field gradients at pores that activate TRPV1 channels when the probe is in the vicinity of the pore. The scanning routine and automated signal analysis demonstrated provide a generalizable approach to employing gated ion channels as sensors for imaging applications.
Collapse
Affiliation(s)
- Cheng Zhu
- Department of Chemistry, Indiana University, 800 E. Kirkwood Avenue, Bloomington, Indiana 47405, United States
| | - Kaixiang Huang
- Department of Chemistry, Indiana University, 800 E. Kirkwood Avenue, Bloomington, Indiana 47405, United States
| | - Yunong Wang
- Department of Chemistry, Indiana University, 800 E. Kirkwood Avenue, Bloomington, Indiana 47405, United States
| | - Kristen Alanis
- Department of Chemistry, Indiana University, 800 E. Kirkwood Avenue, Bloomington, Indiana 47405, United States
| | - Wenqing Shi
- Department of Chemistry, Indiana University, 800 E. Kirkwood Avenue, Bloomington, Indiana 47405, United States
| | - Lane A Baker
- Department of Chemistry, Indiana University, 800 E. Kirkwood Avenue, Bloomington, Indiana 47405, United States
| |
Collapse
|
43
|
Ali MA. Phylotranscriptomic analysis of Dillenia indica L. (Dilleniales, Dilleniaceae) and its systematics implication. Saudi J Biol Sci 2021; 28:1557-1560. [PMID: 33732040 PMCID: PMC7938110 DOI: 10.1016/j.sjbs.2021.01.038] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/09/2020] [Revised: 01/17/2021] [Accepted: 01/18/2021] [Indexed: 11/21/2022] Open
Abstract
The recent massive development in the next-generation sequencing platforms and bioinformatics tools including cloud based computing have proven extremely useful in understanding the deeper-level phylogenetic relationships of angiosperms. The present phylotranscriptomic analyses address the poorly known evolutionary relationships of the order Dilleniales to order of the other angiosperms using the minimum evolution method. The analyses revealed the nesting of the representative taxon of Dilleniales in the MPT but distinct from the representative of the order Santalales, Caryophyllales, Asterales, Cornales, Ericales, Lamiales, Saxifragales, Fabales, Malvales, Vitales and Berberidopsidales.
Collapse
Affiliation(s)
- Mohammad Ajmal Ali
- Department of Botany and Microbiology, College of Science, King Saud University, Riyadh 11451, Saudi Arabia
| |
Collapse
|
44
|
Li Z, Fang S, Zhang R, Yu L, Zhang J, Bu D, Sun L, Zhao Y, Li J. VarBen. J Mol Diagn 2021; 23:285-299. [DOI: 10.1016/j.jmoldx.2020.11.010] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/20/2019] [Revised: 10/06/2020] [Accepted: 11/17/2020] [Indexed: 02/08/2023] Open
|
45
|
Schmeing S, Robinson MD. ReSeq simulates realistic Illumina high-throughput sequencing data. Genome Biol 2021; 22:67. [PMID: 33608040 PMCID: PMC7896392 DOI: 10.1186/s13059-021-02265-7] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/17/2020] [Accepted: 01/07/2021] [Indexed: 12/18/2022] Open
Abstract
In high-throughput sequencing data, performance comparisons between computational tools are essential for making informed decisions at each step of a project. Simulations are a critical part of method comparisons, but for standard Illumina sequencing of genomic DNA, they are often oversimplified, which leads to optimistic results for most tools. ReSeq improves the authenticity of synthetic data by extracting and reproducing key components from real data. Major advancements are the inclusion of systematic errors, a fragment-based coverage model and sampling-matrix estimates based on two-dimensional margins. These improvements lead to more faithful performance evaluations. ReSeq is available at https://github.com/schmeing/ReSeq.
Collapse
Affiliation(s)
- Stephan Schmeing
- Institute of Molecular Life Sciences, University of Zurich, Winterthurerstrasse 190, Zurich, 8057, Switzerland. .,SIB Swiss Institute of Bioinformatics, Winterthurerstrasse 190, Zurich, 8057, Switzerland.
| | - Mark D Robinson
- Institute of Molecular Life Sciences, University of Zurich, Winterthurerstrasse 190, Zurich, 8057, Switzerland. .,SIB Swiss Institute of Bioinformatics, Winterthurerstrasse 190, Zurich, 8057, Switzerland.
| |
Collapse
|
46
|
Richmond PA, Av‐Shalom TV, Fornes O, Modi B, Elliott AM, Wasserman WW. GeneBreaker: Variant simulation to improve the diagnosis of Mendelian rare genetic diseases. Hum Mutat 2021; 42:346-358. [PMID: 33368787 PMCID: PMC8247879 DOI: 10.1002/humu.24163] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/15/2020] [Revised: 11/06/2020] [Accepted: 12/14/2020] [Indexed: 12/21/2022]
Abstract
Mendelian rare genetic diseases affect 5%–10% of the population, and with over 5300 genes responsible for ∼7000 different diseases, they are challenging to diagnose. The use of whole‐genome sequencing (WGS) has bolstered the diagnosis rate significantly. The effective use of WGS relies on the ability to identify the disrupted gene responsible for disease phenotypes. This process involves genomic variant calling and prioritization, and is the beneficiary of improvements to sequencing technology, variant calling approaches, and increased capacity to prioritize genomic variants with potential pathogenicity. As analysis pipelines continue to improve, careful testing of their efficacy is paramount. However, real‐life cases typically emerge anecdotally, and utilization of clinically sensitive and identifiable data for testing pipeline improvements is regulated and limiting. We identified the need for a gene‐based variant simulation framework that can create mock rare disease scenarios, utilizing known pathogenic variants or through the creation of novel gene‐disrupting variants. To fill this need, we present GeneBreaker, a tool that creates synthetic rare disease cases with utility for benchmarking variant calling approaches, testing the efficacy of variant prioritization, and as an educational mechanism for training diagnostic practitioners in the expanding field of genomic medicine. GeneBreaker is freely available at http://GeneBreaker.cmmt.ubc.ca.
Collapse
Affiliation(s)
- Phillip A. Richmond
- Department of Medical Genetics, Center for Molecular Medicine and Therapeutics, BC Children's Hospital Research InstituteUniversity of British ColumbiaVancouverBritish ColumbiaCanada
| | - Tamar V. Av‐Shalom
- Department of Medical Genetics, Center for Molecular Medicine and Therapeutics, BC Children's Hospital Research InstituteUniversity of British ColumbiaVancouverBritish ColumbiaCanada
| | - Oriol Fornes
- Department of Medical Genetics, Center for Molecular Medicine and Therapeutics, BC Children's Hospital Research InstituteUniversity of British ColumbiaVancouverBritish ColumbiaCanada
| | - Bhavi Modi
- Department of Medical Genetics, Center for Molecular Medicine and Therapeutics, BC Children's Hospital Research InstituteUniversity of British ColumbiaVancouverBritish ColumbiaCanada
| | - Alison M. Elliott
- Department of Medical GeneticsUniversity of British ColumbiaVancouverBritish ColumbiaCanada
| | - Wyeth W. Wasserman
- Department of Medical Genetics, Center for Molecular Medicine and Therapeutics, BC Children's Hospital Research InstituteUniversity of British ColumbiaVancouverBritish ColumbiaCanada
| |
Collapse
|
47
|
Petrillo M, Fabbri M, Kagkli DM, Querci M, Van den Eede G, Alm E, Aytan-Aktug D, Capella-Gutierrez S, Carrillo C, Cestaro A, Chan KG, Coque T, Endrullat C, Gut I, Hammer P, Kay GL, Madec JY, Mather AE, McHardy AC, Naas T, Paracchini V, Peter S, Pightling A, Raffael B, Rossen J, Ruppé E, Schlaberg R, Vanneste K, Weber LM, Westh H, Angers-Loustau A. A roadmap for the generation of benchmarking resources for antimicrobial resistance detection using next generation sequencing. F1000Res 2021; 10:80. [DOI: 10.12688/f1000research.39214.1] [Citation(s) in RCA: 7] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Accepted: 02/02/2021] [Indexed: 01/12/2023] Open
Abstract
Next Generation Sequencing technologies significantly impact the field of Antimicrobial Resistance (AMR) detection and monitoring, with immediate uses in diagnosis and risk assessment. For this application and in general, considerable challenges remain in demonstrating sufficient trust to act upon the meaningful information produced from raw data, partly because of the reliance on bioinformatics pipelines, which can produce different results and therefore lead to different interpretations. With the constant evolution of the field, it is difficult to identify, harmonise and recommend specific methods for large-scale implementations over time. In this article, we propose to address this challenge through establishing a transparent, performance-based, evaluation approach to provide flexibility in the bioinformatics tools of choice, while demonstrating proficiency in meeting common performance standards. The approach is two-fold: first, a community-driven effort to establish and maintain “live” (dynamic) benchmarking platforms to provide relevant performance metrics, based on different use-cases, that would evolve together with the AMR field; second, agreed and defined datasets to allow the pipelines’ implementation, validation, and quality-control over time. Following previous discussions on the main challenges linked to this approach, we provide concrete recommendations and future steps, related to different aspects of the design of benchmarks, such as the selection and the characteristics of the datasets (quality, choice of pathogens and resistances, etc.), the evaluation criteria of the pipelines, and the way these resources should be deployed in the community.
Collapse
|
48
|
Nodehi HM, Tabatabaiefar MA, Sehhati M. Selection of Optimal Bioinformatic Tools and Proper Reference for Reducing the Alignment Error in Targeted Sequencing Data. JOURNAL OF MEDICAL SIGNALS & SENSORS 2021; 11:37-44. [PMID: 34026589 PMCID: PMC8043119 DOI: 10.4103/jmss.jmss_7_20] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/20/2020] [Revised: 01/28/2020] [Accepted: 02/12/2020] [Indexed: 11/04/2022]
Abstract
Background Careful design in the primary steps of a next-generation sequencing study is critical for obtaining successful results in downstream analysis. Methods In this study, a framework is proposed to evaluate and improve the sequence mapping in targeted regions of the reference genome. In this regard, simulated short reads were produced from the coding regions of the human genome and mapped to a Customized Target-Based Reference (CTBR) by the alignment tools that have been introduced recently. The short reads produced by different sequencing technologies aligned to the standard genome and also CTBR with and without well-defined mutation types where the amount of unmapped and misaligned reads and runtime was measured for comparison. Results The results showed that the mapping accuracy of the reads generated from Illumina Hiseq2500 using Stampy as the alignment tool whenever the CTBR was used as reference was significantly better than other evaluated pipelines. Using CTBR for alignment significantly decreased the mapping error in comparison to other expanded or more limited references. While intentional mutations were imported in the reads, Stampy showed the minimum error of 1.67% using CTBR. However, the lowest error obtained by stampy too using whole genome and one chromosome as references was 3.78% and 20%, respectively. Maximum and minimum misalignment errors were observed on chromosome Y and 20, respectively. Conclusion Therefore using the proposed framework in a clinical targeted sequencing study may lead to predict the error and improve the performance of variant calling regarding the genomic regions targeted in a clinical study.
Collapse
Affiliation(s)
- Hannane Mohammadi Nodehi
- Department of Bioelectric and Biomedical Engineering, School of Advanced Technologies in Medicine, Isfahan University of Medical Sciences, Isfahan, Iran
| | - Mohammad Amin Tabatabaiefar
- Department of Medical Genetics, School of Medicine, Isfahan University of Medical Sciences, Isfahan, Iran.,Department of Bioinformatics, Medical Image and Signal Processing Research Center, School of Advanced Technologies in Medicine, Isfahan University of Medical Sciences, Isfahan, Iran
| | - Mohammadreza Sehhati
- Department of Bioelectric and Biomedical Engineering, School of Advanced Technologies in Medicine, Isfahan University of Medical Sciences, Isfahan, Iran
| |
Collapse
|
49
|
Chen W, Zhang P, Song L, Yang J, Han C. Simulation of Nanopore Sequencing Signals Based on BiGRU. SENSORS 2020; 20:s20247244. [PMID: 33348876 PMCID: PMC7766754 DOI: 10.3390/s20247244] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 11/05/2020] [Revised: 12/14/2020] [Accepted: 12/15/2020] [Indexed: 01/02/2023]
Abstract
Oxford Nanopore sequencing is an important sequencing technology, which reads the nucleotide sequence by detecting the electrical current signal changes when DNA molecule is forced to pass through a biological nanopore. The research on signal simulation of nanopore sequencing is highly desirable for method developments of nanopore sequencing applications. To improve the simulation accuracy, we propose a novel signal simulation method based on Bi-directional Gated Recurrent Units (BiGRU). In this method, the signal processing model based on BiGRU is built to replace the traditional low-pass filter to post-process the ground-truth signal calculated by the input nucleotide sequence and nanopore sequencing pore model. Gaussian noise is then added to the filtered signal to generate the final simulated signal. This method can accurately model the relation between ground-truth signal and real-world sequencing signal through experimental sequencing data. The simulation results reveal that the proposed method utilizing the powerful learning ability of the neural network can generate the simulated signal that is closer to the real-world sequencing signal in the time and frequency domains than the existing simulation method.
Collapse
Affiliation(s)
- Weigang Chen
- School of Microelectronics, Tianjin University, Tianjin 300072, China; (W.C.); (P.Z.); (J.Y.)
- Frontier Science Center for Synthetic Biology (Ministry of Education), Tianjin University, Tianjin 300072, China;
| | - Peng Zhang
- School of Microelectronics, Tianjin University, Tianjin 300072, China; (W.C.); (P.Z.); (J.Y.)
| | - Lifu Song
- Frontier Science Center for Synthetic Biology (Ministry of Education), Tianjin University, Tianjin 300072, China;
- School of Chemical Engineering and Technology, Tianjin University, Tianjin 300072, China
| | - Jinsheng Yang
- School of Microelectronics, Tianjin University, Tianjin 300072, China; (W.C.); (P.Z.); (J.Y.)
| | - Changcai Han
- School of Microelectronics, Tianjin University, Tianjin 300072, China; (W.C.); (P.Z.); (J.Y.)
- Correspondence:
| |
Collapse
|
50
|
Subkhankulova T, Naumenko F, Tolmachov OE, Orlov YL. Novel ChIP-seq simulating program with superior versatility: isChIP. Brief Bioinform 2020; 22:6035271. [PMID: 33320934 DOI: 10.1093/bib/bbaa352] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/11/2020] [Revised: 10/18/2020] [Accepted: 11/03/2020] [Indexed: 12/13/2022] Open
Abstract
Chromatin immunoprecipitation followed by next-generation sequencing (ChIP-seq) is recognized as an extremely powerful tool to study the interaction of numerous transcription factors and other chromatin-associated proteins with DNA. The core problem in the optimization of ChIP-seq protocol and the following computational data analysis is that a 'true' pattern of binding events for a given protein factor is unknown. Computer simulation of the ChIP-seq process based on 'a-priory known binding template' can contribute to a drastically reduce the number of wet lab experiments and finally help achieve radical optimization of the entire processing pipeline. We present a newly developed ChIP-sequencing simulation algorithm implemented in the novel software, in silico ChIP-seq (isChIP). We demonstrate that isChIP closely approximates real ChIP-seq protocols and is able to model data similar to those obtained from experimental sequencing. We validated isChIP using publicly available datasets generated for well-characterized transcription factors Oct4 and Sox2. Although the novel software is compatible with the Illumina protocols by default, it can also successfully perform simulations with a number of alternative sequencing platforms such as Roche454, Ion Torrent and SOLiD as well as model ChIP -Exo. The versatility of isChIP was demonstrated through modelling a wide range of binding events, including those of transcription factors and chromatin modifiers. We also performed a comparative analysis against a few existing ChIP-seq simulators and showed the fundamental superiority of our model. Due to its ability to utilize known binding templates, isChIP can potentially be employed to help investigators choose the most appropriate analytical software through benchmarking of available ChIP-seq programs and optimize the experimental parameters of ChIP-seq protocol. isChIP software is freely available at https://github.com/fnaumenko/isChIP.
Collapse
Affiliation(s)
| | | | | | - Yuriy L Orlov
- Digital Health Institute, I.M. Sechenov First Moscow State Medical University (Sechenov University), and Senior Scientist at Agrarian and Technological Institute, Peoples' Friendship University of Russia (RUDN University), Russia
| |
Collapse
|