1
|
Jugas R, Vitkova H. ProcaryaSV: structural variation detection pipeline for bacterial genomes using short-read sequencing. BMC Bioinformatics 2024; 25:233. [PMID: 38982375 PMCID: PMC11234778 DOI: 10.1186/s12859-024-05843-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/31/2024] [Accepted: 06/13/2024] [Indexed: 07/11/2024] Open
Abstract
BACKGROUND Structural variations play an important role in bacterial genomes. They can mediate genome adaptation quickly in response to the external environment and thus can also play a role in antibiotic resistance. The detection of structural variations in bacteria is challenging, and the recognition of even small rearrangements can be important. Even though most detection tools are aimed at and benchmarked on eukaryotic genomes, they can also be used on prokaryotic genomes. The key features of detection are the ability to detect small rearrangements and support haploid genomes. Because of the limiting performance of a single detection tool, combining the detection abilities of multiple tools can lead to more robust results. There are already available workflows for structural variation detection for long-reads technologies and for the detection of single-nucleotide variation and indels, both aimed at bacteria. Yet we are unaware of structural variations detection workflows for the short-reads sequencing platform. Motivated by this gap we created our workflow. Further, we were interested in increasing the detection performance and providing more robust results. RESULTS We developed an open-source bioinformatics pipeline, ProcaryaSV, for the detection of structural variations in bacterial isolates from paired-end short sequencing reads. Multiple tools, starting with quality control and trimming of sequencing data, alignment to the reference genome, and multiple structural variation detection tools, are integrated. All the partial results are then processed and merged with an in-house merging algorithm. Compared with a single detection approach, ProcaryaSV has improved detection performance and is a reproducible easy-to-use tool. CONCLUSIONS The ProcaryaSV pipeline provides an integrative approach to structural variation detection from paired-end next-generation sequencing of bacterial samples. It can be easily installed and used on Linux machines. It is publicly available on GitHub at https://github.com/robinjugas/ProcaryaSV .
Collapse
Affiliation(s)
- Robin Jugas
- Department of Biomedical Engineering, Brno University of Technology, Brno, Czech Republic
| | - Helena Vitkova
- Department of Biomedical Engineering, Brno University of Technology, Brno, Czech Republic.
| |
Collapse
|
2
|
Schmidt M, Kutzner A. MSV: a modular structural variant caller that reveals nested and complex rearrangements by unifying breakends inferred directly from reads. Genome Biol 2023; 24:170. [PMID: 37461107 DOI: 10.1186/s13059-023-03009-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/27/2021] [Accepted: 07/06/2023] [Indexed: 07/20/2023] Open
Abstract
Structural variant (SV) calling belongs to the standard tools of modern bioinformatics for identifying and describing alterations in genomes. Initially, this work presents several complex genomic rearrangements that reveal conceptual ambiguities inherent to the representation via basic SV. We contextualize these ambiguities theoretically as well as practically and propose a graph-based approach for resolving them. For various yeast genomes, we practically compute adjacency matrices of our graph model and demonstrate that they provide highly accurate descriptions of one genome in terms of another. An open-source prototype implementation of our approach is available under the MIT license at https://github.com/ITBE-Lab/MA .
Collapse
Affiliation(s)
- Markus Schmidt
- Biomedical Center Munich, Department of Physiological Chemistry, Ludwig-Maximilians-Universität, Großhaderner Str. 9, 82152, Planegg-Martinsried, Germany
| | - Arne Kutzner
- Department of Information Systems, College of Engineering, Hanyang University, 222 Wangsimni-Ro, Seongdong-Gu, Seoul, 133-791, Republic of Korea.
| |
Collapse
|
3
|
Wang S, Liu Y, Wang J, Zhu X, Shi Y, Wang X, Liu T, Xiao X, Wang J. Is an SV caller compatible with sequencing data? An online recommendation tool to automatically recommend the optimal caller based on data features. Front Genet 2023; 13:1096797. [PMID: 36685885 PMCID: PMC9852890 DOI: 10.3389/fgene.2022.1096797] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/12/2022] [Accepted: 12/14/2022] [Indexed: 01/07/2023] Open
Abstract
A lot of bioinformatics tools were released to detect structural variants from the sequencing data during the past decade. For a data analyst, a natural question is about the selection of a tool fits for the data. Thus, this study presents an automatic tool recommendation method to facilitate data analysis. The optimal variant calling tool was recommended from a set of state-of-the-art bioinformatics tools by given a sequencing data. This recommendation method was implemented under a meta-learning framework, identifying the relationships between data features and the performance of tools. First, the meta-features were extracted to characterize the sequencing data and meta-targets were identified to pinpoint the optimal caller for the sequencing data. Second, a meta-model was constructed to bridge the meta-features and meta-targets. Finally, the recommendation was made according to the evaluation from the meta-model. A series of experiments were conducted to validate this recommendation method on both the simulated and real sequencing data. The results revealed that different SV callers often fit different sequencing data. The recommendation accuracy averaged more than 80% across all experimental configurations, outperforming the random- and fixed-pick strategy. To further facilitate the research community, we incorporated the recommendation method into an online cloud services for genomic data analysis, which is available at https://c.solargenomics.com/ via a simple registration. In addition, the source code and a pre-trained model is available at https://github.com/hello-json/CallerRecommendation for academic usages only.
Collapse
Affiliation(s)
- Shenjie Wang
- School of Computer Science and Technology, Xi’an Jiaotong University, Xi’an, China,Shaanxi Engineering Research Center of Medical and Health Big Data, Xi’an Jiaotong University, Xi’an, China
| | - Yuqian Liu
- School of Computer Science and Technology, Xi’an Jiaotong University, Xi’an, China,Shaanxi Engineering Research Center of Medical and Health Big Data, Xi’an Jiaotong University, Xi’an, China
| | - Juan Wang
- School of Computer Science and Technology, Xi’an Jiaotong University, Xi’an, China,Shaanxi Engineering Research Center of Medical and Health Big Data, Xi’an Jiaotong University, Xi’an, China,Annoroad Gene Technology (Beijing) Co. Ltd, Beijing, China,*Correspondence: Juan Wang, ; Jiayin Wang,
| | - Xiaoyan Zhu
- School of Computer Science and Technology, Xi’an Jiaotong University, Xi’an, China,Shaanxi Engineering Research Center of Medical and Health Big Data, Xi’an Jiaotong University, Xi’an, China
| | - Yuzhi Shi
- Annoroad Gene Technology (Beijing) Co. Ltd, Beijing, China
| | - Xuwen Wang
- School of Computer Science and Technology, Xi’an Jiaotong University, Xi’an, China,Shaanxi Engineering Research Center of Medical and Health Big Data, Xi’an Jiaotong University, Xi’an, China
| | - Tao Liu
- Annoroad Gene Technology (Beijing) Co. Ltd, Beijing, China
| | - Xiao Xiao
- Shaanxi Engineering Research Center of Medical and Health Big Data, Xi’an Jiaotong University, Xi’an, China,Geneplus Shenzhen, Shenzhen, China
| | - Jiayin Wang
- School of Computer Science and Technology, Xi’an Jiaotong University, Xi’an, China,Shaanxi Engineering Research Center of Medical and Health Big Data, Xi’an Jiaotong University, Xi’an, China,*Correspondence: Juan Wang, ; Jiayin Wang,
| |
Collapse
|
4
|
Cleal K, Baird DM. Dysgu: efficient structural variant calling using short or long reads. Nucleic Acids Res 2022; 50:e53. [PMID: 35100420 PMCID: PMC9122538 DOI: 10.1093/nar/gkac039] [Citation(s) in RCA: 11] [Impact Index Per Article: 5.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/03/2021] [Revised: 12/20/2021] [Accepted: 01/24/2022] [Indexed: 12/27/2022] Open
Abstract
Structural variation (SV) plays a fundamental role in genome evolution and can underlie inherited or acquired diseases such as cancer. Long-read sequencing technologies have led to improvements in the characterization of structural variants (SVs), although paired-end sequencing offers better scalability. Here, we present dysgu, which calls SVs or indels using paired-end or long reads. Dysgu detects signals from alignment gaps, discordant and supplementary mappings, and generates consensus contigs, before classifying events using machine learning. Additional SVs are identified by remapping of anomalous sequences. Dysgu outperforms existing state-of-the-art tools using paired-end or long-reads, offering high sensitivity and precision whilst being among the fastest tools to run. We find that combining low coverage paired-end and long-reads is competitive in terms of performance with long-reads at higher coverage values.
Collapse
Affiliation(s)
- Kez Cleal
- Division of Cancer and Genetics, School of Medicine, Cardiff University, Heath Park, Cardiff CF14 4XN, UK
| | - Duncan M Baird
- Division of Cancer and Genetics, School of Medicine, Cardiff University, Heath Park, Cardiff CF14 4XN, UK
| |
Collapse
|
5
|
Kim HS, Blazyte A, Jeon S, Yoon C, Kim Y, Kim C, Bolser D, Ahn JH, Edwards JS, Bhak J. LT1, an ONT long-read-based assembly scaffolded with Hi-C data and polished with short reads. GIGABYTE 2022; 2022:gigabyte51. [PMID: 36824523 PMCID: PMC9650228 DOI: 10.46471/gigabyte.51] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/07/2021] [Accepted: 04/29/2022] [Indexed: 11/09/2022] Open
Abstract
We present LT1, the first high-quality human reference genome from the Baltic States. LT1 is a female de novo human reference genome assembly, constructed using 57× nanopore long reads and polished using 47× short paired-end reads. We utilized 72 GB of Hi-C chromosomal mapping data for scaffolding, to maximize assembly contiguity and accuracy. The contig assembly of LT1 was 2.73 Gbp in length, comprising 4490 contigs with an NG50 value of 12.0 Mbp. After scaffolding with Hi-C data and manual curation, the final assembly has an NG50 value of 137 Mbp and 4699 scaffolds. Assessment of gene prediction quality using Benchmarking Universal Single-Copy Orthologs (BUSCO) identified 89.3% of the single-copy orthologous genes included in the benchmark. Detailed characterization of LT1 suggests it has 73,744 predicted transcripts, 4.2 million autosomal SNPs, 974,616 short indels, and 12,079 large structural variants. These data may be used as a benchmark for further in-depth genomic analyses of Baltic populations.
Collapse
Affiliation(s)
- Hui-Su Kim
- Korean Genomics Center (KOGIC), Ulsan National Institute of Science and Technology (UNIST), Ulsan 44919, Republic of Korea, Corresponding authors. E-mail: ;
| | - Asta Blazyte
- Korean Genomics Center (KOGIC), Ulsan National Institute of Science and Technology (UNIST), Ulsan 44919, Republic of Korea,Department of Biomedical Engineering, College of Information and Biotechnology, Ulsan National Institute of Science and Technology (UNIST), Ulsan 44919, Republic of Korea
| | - Sungwon Jeon
- Korean Genomics Center (KOGIC), Ulsan National Institute of Science and Technology (UNIST), Ulsan 44919, Republic of Korea,Department of Biomedical Engineering, College of Information and Biotechnology, Ulsan National Institute of Science and Technology (UNIST), Ulsan 44919, Republic of Korea,Clinomics LTD, Ulsan National Institute of Science and Technology (UNIST), Ulsan 44919, Republic of Korea
| | - Changhan Yoon
- Korean Genomics Center (KOGIC), Ulsan National Institute of Science and Technology (UNIST), Ulsan 44919, Republic of Korea,Department of Biomedical Engineering, College of Information and Biotechnology, Ulsan National Institute of Science and Technology (UNIST), Ulsan 44919, Republic of Korea
| | - Yeonkyung Kim
- Korean Genomics Center (KOGIC), Ulsan National Institute of Science and Technology (UNIST), Ulsan 44919, Republic of Korea,Clinomics LTD, Ulsan National Institute of Science and Technology (UNIST), Ulsan 44919, Republic of Korea
| | - Changjae Kim
- Clinomics LTD, Ulsan National Institute of Science and Technology (UNIST), Ulsan 44919, Republic of Korea
| | - Dan Bolser
- Geromics Ltd., Suite 1, Frohock House, 222 Mill Road, Cambridge, CB1 3NF, UK
| | - Ji-Hye Ahn
- Clinomics LTD, Ulsan National Institute of Science and Technology (UNIST), Ulsan 44919, Republic of Korea
| | - Jeremy S. Edwards
- Department of Chemistry and Chemical Biology, University of New Mexico, Albuquerque, NM 87131, USA
| | - Jong Bhak
- Korean Genomics Center (KOGIC), Ulsan National Institute of Science and Technology (UNIST), Ulsan 44919, Republic of Korea,Department of Biomedical Engineering, College of Information and Biotechnology, Ulsan National Institute of Science and Technology (UNIST), Ulsan 44919, Republic of Korea,Clinomics LTD, Ulsan National Institute of Science and Technology (UNIST), Ulsan 44919, Republic of Korea,Geromics Ltd., Suite 1, Frohock House, 222 Mill Road, Cambridge, CB1 3NF, UK, Corresponding authors. E-mail: ;
| |
Collapse
|
6
|
Lei Y, Meng Y, Guo X, Ning K, Bian Y, Li L, Hu Z, Anashkina AA, Jiang Q, Dong Y, Zhu X. Overview of structural variation calling: Simulation, identification, and visualization. Comput Biol Med 2022; 145:105534. [DOI: 10.1016/j.compbiomed.2022.105534] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/20/2022] [Revised: 04/09/2022] [Accepted: 04/14/2022] [Indexed: 12/11/2022]
|
7
|
Ahsan MU, Liu Q, Fang L, Wang K. NanoCaller for accurate detection of SNPs and indels in difficult-to-map regions from long-read sequencing by haplotype-aware deep neural networks. Genome Biol 2021; 22:261. [PMID: 34488830 PMCID: PMC8419925 DOI: 10.1186/s13059-021-02472-2] [Citation(s) in RCA: 35] [Impact Index Per Article: 11.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/21/2020] [Accepted: 08/19/2021] [Indexed: 11/24/2022] Open
Abstract
Long-read sequencing enables variant detection in genomic regions that are considered difficult-to-map by short-read sequencing. To fully exploit the benefits of longer reads, here we present a deep learning method NanoCaller, which detects SNPs using long-range haplotype information, then phases long reads with called SNPs and calls indels with local realignment. Evaluation on 8 human genomes demonstrates that NanoCaller generally achieves better performance than competing approaches. We experimentally validate 41 novel variants in a widely used benchmarking genome, which could not be reliably detected previously. In summary, NanoCaller facilitates the discovery of novel variants in complex genomic regions from long-read sequencing.
Collapse
Affiliation(s)
- Mian Umair Ahsan
- Raymond G. Perelman Center for Cellular and Molecular Therapeutics, Children's Hospital of Philadelphia, Philadelphia, PA, 19104, USA
| | - Qian Liu
- Raymond G. Perelman Center for Cellular and Molecular Therapeutics, Children's Hospital of Philadelphia, Philadelphia, PA, 19104, USA
| | - Li Fang
- Raymond G. Perelman Center for Cellular and Molecular Therapeutics, Children's Hospital of Philadelphia, Philadelphia, PA, 19104, USA
| | - Kai Wang
- Raymond G. Perelman Center for Cellular and Molecular Therapeutics, Children's Hospital of Philadelphia, Philadelphia, PA, 19104, USA.
- Department of Pathology and Laboratory Medicine, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, 19104, USA.
| |
Collapse
|
8
|
Guo M, Li S, Zhou Y, Li M, Wen Z. Comparative Analysis for the Performance of Long-Read-Based Structural Variation Detection Pipelines in Tandem Repeat Regions. Front Pharmacol 2021; 12:658072. [PMID: 34163355 PMCID: PMC8215501 DOI: 10.3389/fphar.2021.658072] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/25/2021] [Accepted: 05/14/2021] [Indexed: 12/04/2022] Open
Abstract
There has been growing recognition of the vital links between structural variations (SVs) and diverse diseases. Research suggests that, with much longer DNA fragments and abundant contextual information, long-read technologies have advantages in SV detection even in complex repetitive regions. So far, several pipelines for calling SVs from long-read sequencing data have been proposed and used in human genome research. However, the performance of these pipelines is still lack of deep exploration and adequate comparison. In this study, we comprehensively evaluated the performance of three commonly used long-read SV detection pipelines, namely PBSV, Sniffles and PBHoney, especially the performance on detecting the SVs in tandem repeat regions (TRRs). Evaluated by using a robust benchmark for germline SV detection as the gold standard, we thoroughly estimated the precision, recall and F1 score of insertions and deletions detected by the pipelines. Our results revealed that all these pipelines clearly exhibited better performance outside TRRs than that in TRRs. The F1 scores of Sniffles in and outside TRRs were 0.60 and 0.76, respectively. The performance of PBSV was similar to that of Sniffles, and was generally higher than that of PBHoney. In conclusion, our findings can be benefit for choosing the appropriate pipelines in real practice and are good complementary to the application of long-read sequencing technologies in the research of rare diseases.
Collapse
Affiliation(s)
- Mingkun Guo
- College of Chemistry, Sichuan University, Chengdu, China
| | - Shihai Li
- College of Chemistry, Sichuan University, Chengdu, China
| | - Yifan Zhou
- College of Chemistry, Sichuan University, Chengdu, China
| | - Menglong Li
- College of Chemistry, Sichuan University, Chengdu, China
| | - Zhining Wen
- College of Chemistry, Sichuan University, Chengdu, China.,Medical Big Data Center, Sichuan University, Chengdu, China
| |
Collapse
|
9
|
Variant Calling in Next Generation Sequencing Data. SYSTEMS MEDICINE 2021. [DOI: 10.1016/b978-0-12-801238-3.11285-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/20/2022] Open
|
10
|
Özdemir Özdoğan G, Kaya H. Next-Generation Sequencing Data Analysis on Pool-Seq and Low-Coverage Retinoblastoma Data. Interdiscip Sci 2020; 12:302-310. [PMID: 32519123 DOI: 10.1007/s12539-020-00374-8] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/07/2019] [Revised: 04/26/2020] [Accepted: 05/22/2020] [Indexed: 12/31/2022]
Abstract
Next-generation sequencing (NGS) is related to massively parallel or deep deoxyribonucleic acid (DNA) sequencing technology which has revolutionized genomic researches in recent years. Although the cost of generating NGS data was decreased compared to the one at the time of emerging this technology, its cost might still be somewhat a problem. Hence, new strategies as pool-seq and low-coverage NGS data have been developed to overcome the cost problem. Despite decreasing cost, it is important to elucidate whether they are efficient in NGS studies. We applied a bioinformatics pipeline on pool-seq and low-coverage retinoblastoma data retrieved from only tumor data. Retinoblastoma is an eye malignancy in childhood that is initiated by RB1 mutation or MYCN amplification and can lead to the loss of vision of eye(s), and even sometimes life. We applied our pipeline on both retinoblastoma disease data and two other particular data to testify the validity and also for comparison purposes in the aspect of performance. High-confidence variant calls from Genome in a Bottle Consortium were used for fulfilling these purposes. We observed that our pipeline successfully called higher number of variants than a standard pipeline for all these three different data. Besides, the recall and F-score values were quite better in our pipeline as being noteworthy. We further presented our results on disease data in the aspects of the variants, variant types and disease-related genes. This study provides a guideline for performing NGS data analysis pipeline on pool-seq and low-coverage sequencing data in conjunction. To get more conclusive outcomes of these two strategies, we recommend using cancer data having higher mutation rates and larger pools.
Collapse
Affiliation(s)
| | - Hilal Kaya
- Department of Computer Engineering, Ankara Yildirim Beyazit University, 06010, Ankara, Turkey.
| |
Collapse
|
11
|
Abstract
Genomic sequencing technologies have revolutionized mutation detection of the genetic diseases in the past few years. In recent years, the third generation sequencing (TGS) has been gaining insight into more genetic diseases owing to the single molecular and real time sequencing technology. This paper reviews the genomic sequencing revolutionary history first and then focuses on the genetic diseases discovered through the TGS and the clinical effects of the TGS, which is followed by the discussion of the improvement in the bioinformatic analysis for the TGS and its limitations. In summary, the TGS has been enhancing the diagnostic accuracy of genetic diseases in molecular level as well as paving a new way for basic researches and therapies.
Collapse
Affiliation(s)
- Tiantian Xiao
- Clinic of Neonatology, Children's Hospital of Fudan University, Shanghai 201102, China.,Department of Neonatology, Chengdu Women's and Children's Central Hospital, School of Medicine, University of Electronic Science and Technology of China, Chengdu 611731, China
| | - Wenhao Zhou
- Clinic of Neonatology, Children's Hospital of Fudan University, Shanghai 201102, China.,Key Laboratory of Birth Defects, Children's Hospital of Fudan University, Shanghai 201102, China.,Key Laboratory of Neonatal Diseases, Children's Hospital of Fudan University, Shanghai 201102, China
| |
Collapse
|
12
|
Abstract
Identifying structural variation (SV) is essential for genome interpretation but has been historically difficult due to limitations inherent to available genome technologies. Detection methods that use ensemble algorithms and emerging sequencing technologies have enabled the discovery of thousands of SVs, uncovering information about their ubiquity, relationship to disease and possible effects on biological mechanisms. Given the variability in SV type and size, along with unique detection biases of emerging genomic platforms, multiplatform discovery is necessary to resolve the full spectrum of variation. Here, we review modern approaches for investigating SVs and proffer that, moving forwards, studies integrating biological information with detection will be necessary to comprehensively understand the impact of SV in the human genome.
Collapse
Affiliation(s)
- Steve S Ho
- Department of Human Genetics, University of Michigan, Ann Arbor, MI, USA
| | - Alexander E Urban
- Department of Psychiatry and Behavioral Sciences, Stanford University School of Medicine, Stanford, CA, USA
- Department of Genetics, Stanford University School of Medicine, Stanford, CA, USA
| | - Ryan E Mills
- Department of Human Genetics, University of Michigan, Ann Arbor, MI, USA.
- Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, MI, USA.
| |
Collapse
|
13
|
Kuzniar A, Maassen J, Verhoeven S, Santuari L, Shneider C, Kloosterman WP, de Ridder J. sv-callers: a highly portable parallel workflow for structural variant detection in whole-genome sequence data. PeerJ 2020; 8:e8214. [PMID: 31934500 PMCID: PMC6951283 DOI: 10.7717/peerj.8214] [Citation(s) in RCA: 12] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/15/2019] [Accepted: 11/14/2019] [Indexed: 12/19/2022] Open
Abstract
Structural variants (SVs) are an important class of genetic variation implicated in a wide array of genetic diseases including cancer. Despite the advances in whole genome sequencing, comprehensive and accurate detection of SVs in short-read data still poses some practical and computational challenges. We present sv-callers, a highly portable workflow that enables parallel execution of multiple SV detection tools, as well as provide users with example analyses of detected SV callsets in a Jupyter Notebook. This workflow supports easy deployment of software dependencies, configuration and addition of new analysis tools. Moreover, porting it to different computing systems requires minimal effort. Finally, we demonstrate the utility of the workflow by performing both somatic and germline SV analyses on different high-performance computing systems.
Collapse
Affiliation(s)
| | | | | | - Luca Santuari
- Center for Molecular Medicine, University Medical Center Utrecht, Utrecht, Netherlands
| | - Carl Shneider
- Center for Molecular Medicine, University Medical Center Utrecht, Utrecht, Netherlands
| | - Wigard P Kloosterman
- Center for Molecular Medicine, University Medical Center Utrecht, Utrecht, Netherlands
| | - Jeroen de Ridder
- Center for Molecular Medicine, University Medical Center Utrecht, Utrecht, Netherlands
| |
Collapse
|
14
|
Yokoyama TT, Kasahara M. Visualization tools for human structural variations identified by whole-genome sequencing. J Hum Genet 2020; 65:49-60. [PMID: 31666648 PMCID: PMC8075883 DOI: 10.1038/s10038-019-0687-0] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/05/2019] [Revised: 09/27/2019] [Accepted: 10/02/2019] [Indexed: 01/02/2023]
Abstract
Visualizing structural variations (SVs) is a critical step for finding associations between SVs and human traits or diseases. Given that there are many sequencing platforms used for SV identification and given that how best to visualize SVs together with other data, such as read alignments and annotations, depends on research goals, there are dozens of SV visualization tools designed for different research goals and sequencing platforms. Here, we provide a comprehensive survey of over 30 SV visualization tools to help users choose which tools to use. This review targets users who wish to visualize a set of SVs identified from the massively parallel sequencing reads of an individual human genome. We first categorize the ways in which SV visualization tools display SVs into ten major categories, which we denote as view modules. View modules allow readers to understand the features of each SV visualization tool quickly. Next, we introduce the features of individual SV visualization tools from several aspects, including whether SV views are integrated with annotations, whether long-read alignment is displayed, whether underlying data structures are graph-based, the type of SVs shown, whether auditing is possible, whether bird's eye view is available, sequencing platforms, and the number of samples. We hope that this review will serve as a guide for readers on the currently available SV visualization tools and lead to the development of new SV visualization tools in the near future.
Collapse
Affiliation(s)
- Toshiyuki T Yokoyama
- Department of Computational Biology and Medical Sciences, Graduate School of Frontier Sciences, The University of Tokyo, Chiba, Japan
| | - Masahiro Kasahara
- Department of Computational Biology and Medical Sciences, Graduate School of Frontier Sciences, The University of Tokyo, Chiba, Japan.
| |
Collapse
|
15
|
Zhu S, Emrich SJ, Chen DZ. Predicting Local Inversions Using Rectangle Clustering and Representative Rectangle Prediction. IEEE Trans Nanobioscience 2019; 18:316-323. [PMID: 31180865 PMCID: PMC6606370 DOI: 10.1109/tnb.2019.2915060] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/15/2023]
Abstract
As a specific type of structural variation, inversions are enjoying particular traction as a result of their established role in evolution. Using third-generation sequencing technology to predict inversions is growing in interest, but many such methods focus on improving sensitivity, giving rise to either too many false positives or very long running times. In this paper, we propose a new framework for inversion detection based on a combination of two novel theoretical models: rectangle clustering and representative rectangle prediction. This combination can automatically filter out false positive inversion predictions while retaining correct ones, leading to a method that has both high sensitivity and high positive prediction values (PPV). Further, this new framework can run very fast on available data. Our software can be freely obtained at https://github.com/UTbioinf/RigInv.
Collapse
|
16
|
Whitford W, Lehnert K, Snell RG, Jacobsen JC. Evaluation of the performance of copy number variant prediction tools for the detection of deletions from whole genome sequencing data. J Biomed Inform 2019; 94:103174. [PMID: 30965134 DOI: 10.1016/j.jbi.2019.103174] [Citation(s) in RCA: 17] [Impact Index Per Article: 3.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/12/2018] [Revised: 03/12/2019] [Accepted: 04/06/2019] [Indexed: 12/30/2022]
Abstract
BACKGROUND Whole genome sequencing (WGS) has increased in popularity and decreased in cost over the past decade, rendering this approach as a viable and sensitive method for variant detection. In addition to its utility for single nucleotide variant detection, WGS data has the potential to detect Copy Number Variants (CNV) to fine resolution. Many CNV detection software packages have been developed exploiting four main types of data: read pair, split read, read depth, and assembly based methods. The aim of this study was to evaluate the efficiency of each of these main approaches in detecting germline deletions. METHODS WGS data and high confidence deletion calls for the individual NA12878 from the Genome in a Bottle consortium were the benchmark dataset. The performance of BreakDancer, CNVnator, Delly, FermiKit, and Pindel was assessed by comparing the accuracy and sensitivity of each software package in detecting deletions exceeding 1 kb. RESULTS There was considerable variability in the outputs of the different WGS CNV detection programs. The best performance was seen from BreakDancer and Delly, with 92.6% and 96.7% sensitivity, respectively and 34.5% and 68.5% false discovery rate (FDR), respectively. In comparison, Pindel, CNVnator, and FermiKit were less effective with sensitivities of 69.1%, 66.0%, and 15.8%, respectively and FDR of 91.3%, 69.0%, and 31.7%, respectively. Concordance across software packages was poor, with only 27 of the total 612 benchmark deletions identified by all five methodologies. CONCLUSIONS The WGS based CNV detection tools evaluated show disparate performance in identifying deletions ≥1 kb, particularly those utilising different input data characteristics. Software that exploits read pair based data had the highest sensitivity, namely BreakDancer and Delly. BreakDancer also had the second lowest false discovery rate. Therefore, in this analysis read pair methods (BreakDancer in particular) were the best performing approaches for the identification of deletions ≥1 kb, balancing accuracy and sensitivity. There is potential for improvement in the detection algorithms, particularly for reducing FDR. This analysis has validated the utility of WGS based CNV detection software to reliably identify deletions, and these findings will be of use when choosing appropriate software for deletion detection, in both research and diagnostic medicine.
Collapse
Affiliation(s)
- Whitney Whitford
- School of Biological Sciences, The University of Auckland, Auckland, New Zealand; Centre for Brain Research, The University of Auckland, New Zealand.
| | - Klaus Lehnert
- School of Biological Sciences, The University of Auckland, Auckland, New Zealand; Centre for Brain Research, The University of Auckland, New Zealand.
| | - Russell G Snell
- School of Biological Sciences, The University of Auckland, Auckland, New Zealand; Centre for Brain Research, The University of Auckland, New Zealand.
| | - Jessie C Jacobsen
- School of Biological Sciences, The University of Auckland, Auckland, New Zealand; Centre for Brain Research, The University of Auckland, New Zealand.
| |
Collapse
|
17
|
Jiang T, Fu Y, Liu B, Wang Y. Long-Read Based Novel Sequence Insertion Detection With rCANID. IEEE Trans Nanobioscience 2019; 18:343-352. [PMID: 30946672 DOI: 10.1109/tnb.2019.2908438] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/07/2022]
Abstract
Novel sequence insertion (NSI) is an essential category of genome structural variations (SVs), which represents DNA segments absent from the reference genome assembly. It has important biological functions and strong correlation with phenotypes and diseases. The rapid development of long-read sequencing technologies provides the opportunities to discover NSIs more sensitively, since the much longer reads are helpful for the assembly and location of the novel sequences. However, most of state-of-the-art long-read based SV detection approaches are in generic design to detect various kinds of SVs, and they are either not suited to detect NSIs or computationally expensive. Herein, we propose read clustering and assembly-based novel insertion detection tool (rCANID). It applies tailored chimerically aligned and unaligned read clustering and lightweight local assembly methods to reconstruct inserted sequences with low computational cost. Benchmarks on both simulated and real datasets demonstrate that rCANID can discover NSIs sensitively and efficiently, especially for NSI events with long inserted sequences which is still a non-trivial task for state-of-the-art approaches. With its good NSI detection ability, rCANID is suited to be integrated into computational pipelines to play important roles in many cutting-edge genomics studies.
Collapse
|
18
|
Zhang R, Billingsley MM, Mitchell MJ. Biomaterials for vaccine-based cancer immunotherapy. J Control Release 2018; 292:256-276. [PMID: 30312721 PMCID: PMC6355332 DOI: 10.1016/j.jconrel.2018.10.008] [Citation(s) in RCA: 114] [Impact Index Per Article: 19.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/10/2018] [Revised: 10/06/2018] [Accepted: 10/08/2018] [Indexed: 12/28/2022]
Abstract
The development of therapeutic cancer vaccines as a means to generate immune reactivity against tumors has been explored since the early discovery of tumor-specific antigens by Georg Klein in the 1960s. However, challenges including weak immunogenicity, systemic toxicity, and off-target effects of cancer vaccines remain as barriers to their broad clinical translation. Advances in the design and implementation of biomaterials are now enabling enhanced efficacy and reduced toxicity of cancer vaccines by controlling the presentation and release of vaccine components to immune cells and their microenvironment. Here, we discuss the rational design and clinical status of several classes of cancer vaccines (including DNA, mRNA, peptide/protein, and cell-based vaccines) along with novel biomaterial-based delivery technologies that improve their safety and efficacy. Further, strategies for designing new platforms for personalized cancer vaccines are also considered.
Collapse
Affiliation(s)
- Rui Zhang
- Department of Bioengineering, University of Pennsylvania, Philadelphia, PA 19104, United States
| | - Margaret M Billingsley
- Department of Bioengineering, University of Pennsylvania, Philadelphia, PA 19104, United States
| | - Michael J Mitchell
- Department of Bioengineering, University of Pennsylvania, Philadelphia, PA 19104, United States; Abramson Cancer Center, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA 19104, United States.
| |
Collapse
|
19
|
Detecting a long insertion variant in SAMD12 by SMRT sequencing: implications of long-read whole-genome sequencing for repeat expansion diseases. J Hum Genet 2018; 64:191-197. [DOI: 10.1038/s10038-018-0551-7] [Citation(s) in RCA: 24] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/05/2018] [Revised: 11/12/2018] [Accepted: 11/27/2018] [Indexed: 01/21/2023]
|