1
|
Dong L, Zhang Y, Fu B, Swart C, Jiang H, Liu Y, Huggett J, Wielgosz R, Niu C, Li Q, Zhang Y, Park SR, Sui Z, Yu L, Liu Y, Xie Q, Zhang H, Yang Y, Dai X, Shi L, Yin Y, Fang X. Reliable biological and multi-omics research through biometrology. Anal Bioanal Chem 2024; 416:3645-3663. [PMID: 38507042 DOI: 10.1007/s00216-024-05239-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/12/2024] [Revised: 02/28/2024] [Accepted: 03/04/2024] [Indexed: 03/22/2024]
Abstract
Metrology is the science of measurement and its applications, whereas biometrology is the science of biological measurement and its applications. Biometrology aims to achieve accuracy and consistency of biological measurements by focusing on the development of metrological traceability, biological reference measurement procedures, and reference materials. Irreproducibility of biological and multi-omics research results from different laboratories, platforms, and analysis methods is hampering the translation of research into clinical uses and can often be attributed to the lack of biologists' attention to the general principles of metrology. In this paper, the progresses of biometrology including metrology on nucleic acid, protein, and cell measurements and its impacts on the improvement of reliability and comparability in biological research are reviewed. Challenges in obtaining more reliable biological and multi-omics measurements due to the lack of primary reference measurement procedures and new standards for biological reference materials faced by biometrology are discussed. In the future, in addition to establishing reliable reference measurement procedures, developing reference materials from single or multiple parameters to multi-omics scale should be emphasized. Thinking in way of biometrology is warranted for facilitating the translation of high-throughput omics research into clinical practices.
Collapse
Affiliation(s)
- Lianhua Dong
- Center for Advanced Measurement of Science, National Institute of Metrology, Beijing, 100029, China.
| | - Yu Zhang
- Center for Advanced Measurement of Science, National Institute of Metrology, Beijing, 100029, China
| | - Boqiang Fu
- Center for Advanced Measurement of Science, National Institute of Metrology, Beijing, 100029, China
| | - Claudia Swart
- Physikalisch-Technische Bundesanstalt, 38116, Braunschweig, Germany
| | | | - Yahui Liu
- Center for Advanced Measurement of Science, National Institute of Metrology, Beijing, 100029, China
| | - Jim Huggett
- National Measurement Laboratory at LGC (NML), Teddington, Middlesex, UK
| | - Robert Wielgosz
- Bureau International Des Poids Et Mesures (BIPM), Pavillon de Breteuil, 92312, Sèvres Cedex, France
| | - Chunyan Niu
- Center for Advanced Measurement of Science, National Institute of Metrology, Beijing, 100029, China
| | - Qianyi Li
- BGI, BGI-Shenzhen, Shenzhen, 518083, China
| | - Yongzhuo Zhang
- Center for Advanced Measurement of Science, National Institute of Metrology, Beijing, 100029, China
| | - Sang-Ryoul Park
- Korea Research Institute of Standards and Science, Daejeon, Republic of Korea
| | - Zhiwei Sui
- Center for Advanced Measurement of Science, National Institute of Metrology, Beijing, 100029, China
| | - Lianchao Yu
- Center for Advanced Measurement of Science, National Institute of Metrology, Beijing, 100029, China
| | | | - Qing Xie
- BGI, BGI-Shenzhen, Shenzhen, 518083, China
| | - Hongfu Zhang
- BGI Genomics, BGI-Shenzhen, Shenzhen, 518083, China
| | | | - Xinhua Dai
- Center for Advanced Measurement of Science, National Institute of Metrology, Beijing, 100029, China.
| | - Leming Shi
- State Key Laboratory of Genetic Engineering, School of Life Sciences and Human Phenome Institute, Fudan University, Shanghai, 200438, China
| | - Ye Yin
- BGI, BGI-Shenzhen, Shenzhen, 518083, China.
| | - Xiang Fang
- Center for Advanced Measurement of Science, National Institute of Metrology, Beijing, 100029, China.
| |
Collapse
|
2
|
Charron P, Kang M. VariantDetective: an accurate all-in-one pipeline for detecting consensus bacterial SNPs and SVs. Bioinformatics 2024; 40:btae066. [PMID: 38366603 PMCID: PMC10898327 DOI: 10.1093/bioinformatics/btae066] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/10/2023] [Revised: 01/16/2024] [Accepted: 02/14/2024] [Indexed: 02/18/2024] Open
Abstract
MOTIVATION Genomic variations comprise a spectrum of alterations, ranging from single nucleotide polymorphisms (SNPs) to large-scale structural variants (SVs), which play crucial roles in bacterial evolution and species diversification. Accurately identifying SNPs and SVs is beneficial for subsequent evolutionary and epidemiological studies. This study presents VariantDetective (VD), a novel, user-friendly, and all-in-one pipeline combining SNP and SV calling to generate consensus genomic variants using multiple tools. RESULTS The VD pipeline accepts various file types as input to initiate SNP and/or SV calling, and benchmarking results demonstrate VD's robustness and high accuracy across multiple tested datasets when compared to existing variant calling approaches. AVAILABILITY AND IMPLEMENTATION The source code, test data, and relevant information for VD are freely accessible at https://github.com/OLF-Bioinformatics/VariantDetective under the MIT License.
Collapse
Affiliation(s)
- Philippe Charron
- Ottawa Laboratory-Fallowfield, Canadian Food Inspection Agency, 3851 Fallowfield Road, Nepean, Ontario K2J 4S1, Canada
| | - Mingsong Kang
- Ottawa Laboratory-Fallowfield, Canadian Food Inspection Agency, 3851 Fallowfield Road, Nepean, Ontario K2J 4S1, Canada
| |
Collapse
|
3
|
Meng X, Wang M, Luo M, Sun L, Yan Q, Liu Y. Systematic evaluation of multiple NGS platforms for structural variants detection. J Biol Chem 2023; 299:105436. [PMID: 37944616 PMCID: PMC10724692 DOI: 10.1016/j.jbc.2023.105436] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/28/2023] [Revised: 10/29/2023] [Accepted: 10/31/2023] [Indexed: 11/12/2023] Open
Abstract
Structural variations (SV) are critical genome changes affecting human diseases. Although many hybridization-based methods exist, evaluating SVs through next-generation sequencing (NGS) data is still necessary for broader research exploration. Here, we comprehensively compared the performance of 16 SV callers and multiple NGS platforms using NA12878 whole genome sequencing (WGS) datasets. The results indicated that several SV callers performed well relatively, such as Manta, GRIDSS, LUMPY, TARDIS, FermiKit, and Wham. Meanwhile, all NGS platforms have a similar performance using a single software. Additionally, we found that the source of undetected SVs was mostly from long reads datasets, therefore, the more appropriate strategy for accurate SV detection will be an integration of long and shorter reads in the future. At present, in the period of NGS as a mainstream method in bioinformatics, our study would provide helpful and comprehensive guidelines for specific categories of SV research.
Collapse
Affiliation(s)
- Xuan Meng
- School of Medicine, Southern University of Science and Technology, Shenzhen, China
| | - Miao Wang
- Research Cooperation Department, GeneMind Biosciences Company Limited, Shenzhen, China
| | - Mingjie Luo
- Research Cooperation Department, GeneMind Biosciences Company Limited, Shenzhen, China
| | - Lei Sun
- Research Cooperation Department, GeneMind Biosciences Company Limited, Shenzhen, China
| | - Qin Yan
- Research Cooperation Department, GeneMind Biosciences Company Limited, Shenzhen, China
| | - Yongfeng Liu
- Research Cooperation Department, GeneMind Biosciences Company Limited, Shenzhen, China.
| |
Collapse
|
4
|
Wei ZG, Bu PY, Zhang XD, Liu F, Qian Y, Wu FX. invMap: a sensitive mapping tool for long noisy reads with inversion structural variants. BIOINFORMATICS (OXFORD, ENGLAND) 2023; 39:btad726. [PMID: 38058196 DOI: 10.1093/bioinformatics/btad726] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 09/06/2023] [Revised: 11/02/2023] [Accepted: 12/05/2023] [Indexed: 12/08/2023]
Abstract
MOTIVATION Longer reads produced by PacBio or Oxford Nanopore sequencers could more frequently span the breakpoints of structural variations (SVs) than shorter reads. Therefore, existing long-read mapping methods often generate wrong alignments and variant calls. Compared to deletions and insertions, inversion events are more difficult to be detected since the anchors in inversion regions are nonlinear to those in SV-free regions. To address this issue, this study presents a novel long-read mapping algorithm (named as invMap). RESULTS For each long noisy read, invMap first locates the aligned region with a specifically designed scoring method for chaining, then checks the remaining anchors in the aligned region to discover potential inversions. We benchmark invMap on simulated datasets across different genomes and sequencing coverages, experimental results demonstrate that invMap is more accurate to locate aligned regions and call SVs for inversions than the competing methods. The real human genome sequencing dataset of NA12878 illustrates that invMap can effectively find more candidate variant calls for inversions than the competing methods. AVAILABILITY AND IMPLEMENTATION The invMap software is available at https://github.com/zhang134/invMap.git.
Collapse
Affiliation(s)
- Ze-Gang Wei
- School of Physics and Optoelectronics Technology, Baoji University of Arts and Sciences, Baoji 721016, China
- Division of Biomedical Engineering, Department of Computer Science and Department of Mechanical Engineering, University of Saskatchewan, Saskatoon, SK S7N 5A9, Canada
| | - Peng-Yu Bu
- School of Physics and Optoelectronics Technology, Baoji University of Arts and Sciences, Baoji 721016, China
| | - Xiao-Dan Zhang
- School of Physics and Optoelectronics Technology, Baoji University of Arts and Sciences, Baoji 721016, China
| | - Fei Liu
- School of Physics and Optoelectronics Technology, Baoji University of Arts and Sciences, Baoji 721016, China
| | - Yu Qian
- School of Physics and Optoelectronics Technology, Baoji University of Arts and Sciences, Baoji 721016, China
| | - Fang-Xiang Wu
- Division of Biomedical Engineering, Department of Computer Science and Department of Mechanical Engineering, University of Saskatchewan, Saskatoon, SK S7N 5A9, Canada
| |
Collapse
|
5
|
Bezdvornykh I, Cherkasov N, Kanapin A, Samsonova A. A collection of read depth profiles at structural variant breakpoints. Sci Data 2023; 10:186. [PMID: 37024526 PMCID: PMC10079824 DOI: 10.1038/s41597-023-02076-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/18/2022] [Accepted: 03/16/2023] [Indexed: 04/08/2023] Open
Abstract
SWaveform, a newly created open genome-wide resource for read depth signal in the vicinity of structural variant (SV) breakpoints, aims to boost development of computational tools and algorithms for discovery of genomic rearrangement events from sequencing data. SVs are a dominant force shaping genomes and substantially contributing to genetic diversity. Still, there are challenges in reliable and efficient genotyping of SVs from whole genome sequencing data, thus delaying translation into clinical applications and wasting valuable resources. SWaveform includes a database containing ~7 M of read depth profiles at SV breakpoints extracted from 911 sequencing samples generated by the Human Genome Diversity Project, generalised patterns of the signal at breakpoints, an interface for navigation and download, as well as a toolbox for local deployment with user's data. The dataset can be of immense value to bioinformatics and engineering communities as it empowers smooth application of intelligent signal processing and machine learning techniques for discovery of genomic rearrangement events and thus opens the floodgates for development of innovative algorithms and software.
Collapse
Affiliation(s)
- Igor Bezdvornykh
- Institute of Translational Biomedicine, Saint Petersburg State University, Saint Petersburg, 199004, Russia
| | - Nikolay Cherkasov
- Institute of Translational Biomedicine, Saint Petersburg State University, Saint Petersburg, 199004, Russia
| | - Alexander Kanapin
- Institute of Translational Biomedicine, Saint Petersburg State University, Saint Petersburg, 199004, Russia
| | - Anastasia Samsonova
- Institute of Translational Biomedicine, Saint Petersburg State University, Saint Petersburg, 199004, Russia.
| |
Collapse
|
6
|
Kim H, Shim Y, Lee TG, Won D, Choi JR, Shin S, Lee ST. Copy-number analysis by base-level normalization: An intuitive visualization tool for evaluating copy number variations. Clin Genet 2023; 103:35-44. [PMID: 36152294 DOI: 10.1111/cge.14236] [Citation(s) in RCA: 5] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/08/2022] [Revised: 09/19/2022] [Accepted: 09/20/2022] [Indexed: 12/13/2022]
Abstract
Next-generation sequencing (NGS) facilitates comprehensive molecular analyses that help with diagnosing unsolved disorders. In addition to detecting single-nucleotide variations and small insertions/deletions, bioinformatics tools can identify copy number variations (CNVs) in NGS data, which improves the diagnostic yield. However, due to the possibility of false positives, subsequent confirmation tests are generally performed. Here, we introduce Copy-number Analysis by BAse-level NormAlization (CABANA), a visualization tool that allows users to intuitively identify candidate CNVs using the normalized single-base-level read depth calculated from NGS data. To demonstrate how CABANA works, NGS data were obtained from 474 patients with neuromuscular disorders. CNVs were screened using a conventional bioinformatics tool, ExomeDepth, and then we normalized and visualized those data at the single-base level using CABANA, followed by manual inspection by geneticists to filter out false positives and determine candidate CNVs. In doing so, we identified 31 candidate CNVs (7%) in 474 patients and subsequently confirmed all of them to be true using multiplex ligation-dependent probe amplification. The performance of CABANA was deemed acceptable by comparing its diagnostic yield with previous data about neuromuscular disorders. Despite some limitations, we expect CABANA to help researchers accurately identify CNVs and reduce the need for subsequent confirmation testing.
Collapse
Affiliation(s)
- Hongkyung Kim
- Department of Laboratory Medicine, Yonsei University College of Medicine, Severance Hospital, Seoul, Republic of Korea
| | - Yeeun Shim
- Brain Korea 21 PLUS Project for Medical Science, Yonsei University, Seoul, Republic of Korea
| | - Taek Gyu Lee
- Brain Korea 21 PLUS Project for Medical Science, Yonsei University, Seoul, Republic of Korea
| | - Dongju Won
- Department of Laboratory Medicine, Yonsei University College of Medicine, Severance Hospital, Seoul, Republic of Korea
| | - Jong Rak Choi
- Department of Laboratory Medicine, Yonsei University College of Medicine, Severance Hospital, Seoul, Republic of Korea.,Dxome Co. Ltd, Seongnam-si, Gyeonggi-do, Republic of Korea
| | - Saeam Shin
- Department of Laboratory Medicine, Yonsei University College of Medicine, Severance Hospital, Seoul, Republic of Korea
| | - Seung-Tae Lee
- Department of Laboratory Medicine, Yonsei University College of Medicine, Severance Hospital, Seoul, Republic of Korea.,Dxome Co. Ltd, Seongnam-si, Gyeonggi-do, Republic of Korea
| |
Collapse
|
7
|
Kaplun L, Krautz-Peterson G, Neerman N, Stanley C, Hussey S, Folwick M, McGarry A, Weiss S, Kaplun A. ONT long-read WGS for variant discovery and orthogonal confirmation of short read WGS derived genetic variants in clinical genetic testing. Front Genet 2023; 14:1145285. [PMID: 37152986 PMCID: PMC10160624 DOI: 10.3389/fgene.2023.1145285] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/16/2023] [Accepted: 04/05/2023] [Indexed: 05/09/2023] Open
Abstract
Technological advances in Next-Generation Sequencing dramatically increased clinical efficiency of genetic testing, allowing detection of a wide variety of variants, from single nucleotide events to large structural aberrations. Whole Genome Sequencing (WGS) has allowed exploration of areas of the genome that might not have been targeted by other approaches, such as intergenic regions. A single technique detecting all genetic variants at once is intended to expedite the diagnostic process while making it more comprehensive and efficient. Nevertheless, there are still several shortcomings that cannot be effectively addressed by short read sequencing, such as determination of the precise size of short tandem repeat (STR) expansions, phasing of potentially compound recessive variants, resolution of some structural variants and exact determination of their boundaries, etc. Therefore, in some cases variants can only be tentatively detected by short reads sequencing and require orthogonal confirmation, particularly for clinical reporting purposes. Moreover, certain regulatory authorities, for example, New York state CLIA, require orthogonal confirmation of every reportable variant. Such orthogonal confirmations often involve numerous different techniques, not necessarily available in the same laboratory and not always performed in an expedited manner, thus negating the advantages of "one-technique-for-all" approach, and making the process lengthy, prone to logistical and analytical faults, and financially inefficient. Fortunately, those weak spots of short read sequencing can be compensated by long read technology that have comparable or better detection of some types of variants while lacking the mentioned above limitations of short read sequencing. At Variantyx we have developed an integrated clinical genetic testing approach, augmenting short read WGS-based variant detection with Oxford Nanopore Technologies (ONT) long read sequencing, providing simultaneous orthogonal confirmation of all types of variants with the additional benefit of improved identification of exact size and position of the detected aberrations. The validation study of this augmented test has demonstrated that Oxford Nanopore Technologies sequencing can efficiently verify multiple types of reportable variants, thus ensuring highly reliable detection and a quick turnaround time for WGS-based clinical genetic testing.
Collapse
|
8
|
Schikora-Tamarit MÀ, Gabaldón T. PerSVade: personalized structural variant detection in any species of interest. Genome Biol 2022; 23:175. [PMID: 35974382 PMCID: PMC9380391 DOI: 10.1186/s13059-022-02737-4] [Citation(s) in RCA: 6] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/29/2021] [Accepted: 07/22/2022] [Indexed: 11/12/2022] Open
Abstract
Structural variants (SVs) underlie genomic variation but are often overlooked due to difficult detection from short reads. Most algorithms have been tested on humans, and it remains unclear how applicable they are in other organisms. To solve this, we develop perSVade (personalized structural variation detection), a sample-tailored pipeline that provides optimally called SVs and their inferred accuracy, as well as small and copy number variants. PerSVade increases SV calling accuracy on a benchmark of six eukaryotes. We find no universal set of optimal parameters, underscoring the need for sample-specific parameter optimization. PerSVade will facilitate SV detection and study across diverse organisms.
Collapse
Affiliation(s)
- Miquel Àngel Schikora-Tamarit
- Barcelona Supercomputing Centre (BSC-CNS), Plaça Eusebi Güell, 1-3, 08034, Barcelona, Spain
- Institute for Research in Biomedicine (IRB Barcelona), The Barcelona Institute of Science and Technology, Baldiri Reixac, 10, 08028, Barcelona, Spain
| | - Toni Gabaldón
- Barcelona Supercomputing Centre (BSC-CNS), Plaça Eusebi Güell, 1-3, 08034, Barcelona, Spain.
- Institute for Research in Biomedicine (IRB Barcelona), The Barcelona Institute of Science and Technology, Baldiri Reixac, 10, 08028, Barcelona, Spain.
- Catalan Institution for Research and Advanced Studies (ICREA), Barcelona, Spain.
- Centro Investigación Biomédica En Red de Enfermedades Infecciosas, Barcelona, Spain.
| |
Collapse
|
9
|
Sarwal V, Niehus S, Ayyala R, Kim M, Sarkar A, Chang S, Lu A, Rajkumar N, Darfci-Maher N, Littman R, Chhugani K, Soylev A, Comarova Z, Wesel E, Castellanos J, Chikka R, Distler MG, Eskin E, Flint J, Mangul S. A comprehensive benchmarking of WGS-based deletion structural variant callers. Brief Bioinform 2022; 23:6618239. [PMID: 35753701 DOI: 10.1093/bib/bbac221] [Citation(s) in RCA: 5] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/08/2021] [Revised: 04/30/2022] [Accepted: 05/11/2022] [Indexed: 01/10/2023] Open
Abstract
Advances in whole-genome sequencing (WGS) promise to enable the accurate and comprehensive structural variant (SV) discovery. Dissecting SVs from WGS data presents a substantial number of challenges and a plethora of SV detection methods have been developed. Currently, evidence that investigators can use to select appropriate SV detection tools is lacking. In this article, we have evaluated the performance of SV detection tools on mouse and human WGS data using a comprehensive polymerase chain reaction-confirmed gold standard set of SVs and the genome-in-a-bottle variant set, respectively. In contrast to the previous benchmarking studies, our gold standard dataset included a complete set of SVs allowing us to report both precision and sensitivity rates of the SV detection methods. Our study investigates the ability of the methods to detect deletions, thus providing an optimistic estimate of SV detection performance as the SV detection methods that fail to detect deletions are likely to miss more complex SVs. We found that SV detection tools varied widely in their performance, with several methods providing a good balance between sensitivity and precision. Additionally, we have determined the SV callers best suited for low- and ultralow-pass sequencing data as well as for different deletion length categories.
Collapse
Affiliation(s)
- Varuni Sarwal
- Department of Computer Science, University of California Los Angeles, 580 Portola Plaza, Los Angeles, CA 90095, USA.,Indian Institute of Technology Delhi, Hauz Khas, New Delhi, Delhi 110016, India
| | - Sebastian Niehus
- Berlin Institute of Health (BIH), Anna-Louisa-Karsch-Str. 2, 10178 Berlin, Germany.,Charité-Universitätsmedizin Berlin, corporate member of Freie Universität Berlin, Humboldt-Universität zu Berlin, and Berlin Institute of Health, Charitéplatz 1, 10117 Berlin, Germany
| | - Ram Ayyala
- Department of Computer Science, University of California Los Angeles, 580 Portola Plaza, Los Angeles, CA 90095, USA
| | - Minyoung Kim
- Department of Quantitative and Computational Biology, University of Southern California, 1050 Childs Way, Los Angeles, CA 90089
| | - Aditya Sarkar
- School of Computing and Electrical Engineering, Indian Institute of Technology Mandi, Kamand, Mandi, Himachal Pradesh 175001, India
| | - Sei Chang
- Department of Computer Science, University of California Los Angeles, 580 Portola Plaza, Los Angeles, CA 90095, USA
| | - Angela Lu
- Department of Computer Science, University of California Los Angeles, 580 Portola Plaza, Los Angeles, CA 90095, USA
| | - Neha Rajkumar
- Department of Bioengineering, Department of Bioengineering, University of California Los Angeles, Los Angeles, CA, 90095
| | - Nicholas Darfci-Maher
- Department of Computer Science, University of California Los Angeles, 580 Portola Plaza, Los Angeles, CA 90095, USA
| | - Russell Littman
- Department of Computer Science, University of California Los Angeles, 580 Portola Plaza, Los Angeles, CA 90095, USA
| | - Karishma Chhugani
- Department of Clinical Pharmacy, School of Pharmacy, University of Southern California 1985 Zonal Avenue Los Angeles, CA 90089-9121
| | - Arda Soylev
- Department of Computer Engineering, Konya Food and Agriculture University, Konya, Turkey
| | - Zoia Comarova
- Department Civil and Environmental Engineering, University of Southern California, Los Angeles, CA, United States
| | - Emily Wesel
- Department of Computer Science, University of California Los Angeles, 580 Portola Plaza, Los Angeles, CA 90095, USA
| | - Jacqueline Castellanos
- Department of Computer Science, University of California Los Angeles, 580 Portola Plaza, Los Angeles, CA 90095, USA
| | - Rahul Chikka
- Department of Computer Science, University of California Los Angeles, 580 Portola Plaza, Los Angeles, CA 90095, USA
| | - Margaret G Distler
- Department of Computer Science, University of California Los Angeles, 580 Portola Plaza, Los Angeles, CA 90095, USA
| | - Eleazar Eskin
- Department of Computer Science, University of California Los Angeles, 580 Portola Plaza, Los Angeles, CA 90095, USA.,Department of Human Genetics, David Geffen School of Medicine at UCLA, 695 Charles E. Young Drive South, Box 708822, Los Angeles, CA, 90095, USA.,Department of Computational Medicine, David Geffen School of Medicine at UCLA, 73-235 CHS, Los Angeles, CA, 90095, USA
| | - Jonathan Flint
- Center for Neurobehavioral Genetics, Semel Institute for Neuroscience and Human Behavior, University of California Los Angeles, 760 Westwood Plaza, Los Angeles, CA 90095, USA
| | - Serghei Mangul
- Department of Clinical Pharmacy, School of Pharmacy, University of Southern California 1985 Zonal Avenue Los Angeles, CA 90089-9121
| |
Collapse
|
10
|
Cleal K, Baird DM. Dysgu: efficient structural variant calling using short or long reads. Nucleic Acids Res 2022; 50:e53. [PMID: 35100420 PMCID: PMC9122538 DOI: 10.1093/nar/gkac039] [Citation(s) in RCA: 11] [Impact Index Per Article: 5.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/03/2021] [Revised: 12/20/2021] [Accepted: 01/24/2022] [Indexed: 12/27/2022] Open
Abstract
Structural variation (SV) plays a fundamental role in genome evolution and can underlie inherited or acquired diseases such as cancer. Long-read sequencing technologies have led to improvements in the characterization of structural variants (SVs), although paired-end sequencing offers better scalability. Here, we present dysgu, which calls SVs or indels using paired-end or long reads. Dysgu detects signals from alignment gaps, discordant and supplementary mappings, and generates consensus contigs, before classifying events using machine learning. Additional SVs are identified by remapping of anomalous sequences. Dysgu outperforms existing state-of-the-art tools using paired-end or long-reads, offering high sensitivity and precision whilst being among the fastest tools to run. We find that combining low coverage paired-end and long-reads is competitive in terms of performance with long-reads at higher coverage values.
Collapse
Affiliation(s)
- Kez Cleal
- Division of Cancer and Genetics, School of Medicine, Cardiff University, Heath Park, Cardiff CF14 4XN, UK
| | - Duncan M Baird
- Division of Cancer and Genetics, School of Medicine, Cardiff University, Heath Park, Cardiff CF14 4XN, UK
| |
Collapse
|
11
|
Yang J, Chaisson MJP. TT-Mars: structural variants assessment based on haplotype-resolved assemblies. Genome Biol 2022; 23:110. [PMID: 35524317 PMCID: PMC9077962 DOI: 10.1186/s13059-022-02666-2] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/29/2021] [Accepted: 03/30/2022] [Indexed: 01/30/2023] Open
Abstract
Variant benchmarking is often performed by comparing a test callset to a gold standard set of variants. In repetitive regions of the genome, it may be difficult to establish what is the truth for a call, for example, when different alignment scoring metrics provide equally supported but different variant calls on the same data. Here, we provide an alternative approach, TT-Mars, that takes advantage of the recent production of high-quality haplotype-resolved genome assemblies by providing false discovery rates for variant calls based on how well their call reflects the content of the assembly, rather than comparing calls themselves.
Collapse
Affiliation(s)
- Jianzhi Yang
- Department of Quantitative and Computational Biology, University of Southern California, Los Angeles, CA, USA
| | - Mark J P Chaisson
- Department of Quantitative and Computational Biology, University of Southern California, Los Angeles, CA, USA.
| |
Collapse
|
12
|
Gordeeva V, Sharova E, Arapidi G. Progress in Methods for Copy Number Variation Profiling. Int J Mol Sci 2022; 23:ijms23042143. [PMID: 35216262 PMCID: PMC8879278 DOI: 10.3390/ijms23042143] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/17/2022] [Revised: 02/09/2022] [Accepted: 02/11/2022] [Indexed: 02/04/2023] Open
Abstract
Copy number variations (CNVs) are the predominant class of structural genomic variations involved in the processes of evolutionary adaptation, genomic disorders, and disease progression. Compared with single-nucleotide variants, there have been challenges associated with the detection of CNVs owing to their diverse sizes. However, the field has seen significant progress in the past 20–30 years. This has been made possible due to the rapid development of molecular diagnostic methods which ensure a more detailed view of the genome structure, further complemented by recent advances in computational methods. Here, we review the major approaches that have been used to routinely detect CNVs, ranging from cytogenetics to the latest sequencing technologies, and then cover their specific features.
Collapse
Affiliation(s)
- Veronika Gordeeva
- Center for Precision Genome Editing and Genetic Technologies for Biomedicine, Federal Research and Clinical Center of Physical-Chemical Medicine of Federal Medical Biological Agency, 119435 Moscow, Russia
- Federal Research and Clinical Center of Physical-Chemical Medicine of Federal Medical Biological Agency, 119435 Moscow, Russia; (E.S.); (G.A.)
- Moscow Institute of Physics and Technology, National Research University, Moscow Oblast, 141701 Moscow, Russia
- Correspondence:
| | - Elena Sharova
- Federal Research and Clinical Center of Physical-Chemical Medicine of Federal Medical Biological Agency, 119435 Moscow, Russia; (E.S.); (G.A.)
| | - Georgij Arapidi
- Federal Research and Clinical Center of Physical-Chemical Medicine of Federal Medical Biological Agency, 119435 Moscow, Russia; (E.S.); (G.A.)
- Moscow Institute of Physics and Technology, National Research University, Moscow Oblast, 141701 Moscow, Russia
- Shemyakin–Ovchinnikov Institute of Bioorganic Chemistry, Russian Academy of Sciences, 117997 Moscow, Russia
| |
Collapse
|
13
|
Combining callers improves the detection of copy number variants from whole-genome sequencing. Eur J Hum Genet 2022; 30:178-186. [PMID: 34744167 PMCID: PMC8821561 DOI: 10.1038/s41431-021-00983-x] [Citation(s) in RCA: 17] [Impact Index Per Article: 8.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/18/2020] [Revised: 09/23/2021] [Accepted: 10/04/2021] [Indexed: 01/03/2023] Open
Abstract
Copy Number Variants (CNVs) are deletions, duplications or insertions larger than 50 base pairs. They account for a large percentage of the normal genome variation and play major roles in human pathology. While array-based approaches have long been used to detect them in clinical practice, whole-genome sequencing (WGS) bears the promise to allow concomitant exploration of CNVs and smaller variants. However, accurately calling CNVs from WGS remains a difficult computational task, for which a consensus is still lacking. In this paper, we explore practical calling options to reach the best compromise between sensitivity and sensibility. We show that callers based on different signal (paired-end reads, split reads, coverage depth) yield complementary results. We suggest approaches combining four selected callers (Manta, Delly, ERDS, CNVnator) and a regenotyping tool (SV2), and show that this is applicable in everyday practice in terms of computation time and further interpretation. We demonstrate the superiority of these approaches over array-based Comparative Genomic Hybridization (aCGH), specifically regarding the lack of resolution in breakpoint definition and the detection of potentially relevant CNVs. Finally, we confirm our results on the NA12878 benchmark genome, as well as one clinically validated sample. In conclusion, we suggest that WGS constitutes a timely and economically valid alternative to the combination of aCGH and whole-exome sequencing.
Collapse
|
14
|
Comprehensive characterization of copy number variation (CNV) called from array, long- and short-read data. BMC Genomics 2021; 22:826. [PMID: 34789167 PMCID: PMC8596897 DOI: 10.1186/s12864-021-08082-3] [Citation(s) in RCA: 6] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/09/2021] [Accepted: 10/13/2021] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND SNP arrays, short- and long-read genome sequencing are genome-wide high-throughput technologies that may be used to assay copy number variants (CNVs) in a personal genome. Each of these technologies comes with its own limitations and biases, many of which are well-known, but not all of them are thoroughly quantified. RESULTS We assembled an ensemble of public datasets of published CNV calls and raw data for the well-studied Genome in a Bottle individual NA12878. This assembly represents a variety of methods and pipelines used for CNV calling from array, short- and long-read technologies. We then performed cross-technology comparisons regarding their ability to call CNVs. Different from other studies, we refrained from using the golden standard. Instead, we attempted to validate the CNV calls by the raw data of each technology. CONCLUSIONS Our study confirms that long-read platforms enable recalling CNVs in genomic regions inaccessible to arrays or short reads. We also found that the reproducibility of a CNV by different pipelines within each technology is strongly linked to other CNV evidence measures. Importantly, the three technologies show distinct public database frequency profiles, which differ depending on what technology the database was built on.
Collapse
|
15
|
Zhang YZ, Imoto S, Miyano S, Yamaguchi R. Enhancing breakpoint resolution with deep segmentation model: A general refinement method for read-depth based structural variant callers. PLoS Comput Biol 2021; 17:e1009186. [PMID: 34634042 PMCID: PMC8504719 DOI: 10.1371/journal.pcbi.1009186] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/08/2020] [Accepted: 06/15/2021] [Indexed: 11/30/2022] Open
Abstract
Read-depths (RDs) are frequently used in identifying structural variants (SVs) from sequencing data. For existing RD-based SV callers, it is difficult for them to determine breakpoints in single-nucleotide resolution due to the noisiness of RD data and the bin-based calculation. In this paper, we propose to use the deep segmentation model UNet to learn base-wise RD patterns surrounding breakpoints of known SVs. We integrate model predictions with an RD-based SV caller to enhance breakpoints in single-nucleotide resolution. We show that UNet can be trained with a small amount of data and can be applied both in-sample and cross-sample. An enhancement pipeline named RDBKE significantly increases the number of SVs with more precise breakpoints on simulated and real data. The source code of RDBKE is freely available at https://github.com/yaozhong/deepIntraSV.
Collapse
Affiliation(s)
- Yao-Zhong Zhang
- Division of Health Medical Intelligence, Institute of Medical Science, the University of Tokyo, Tokyo, Japan
| | - Seiya Imoto
- Division of Health Medical Intelligence, Institute of Medical Science, the University of Tokyo, Tokyo, Japan
| | - Satoru Miyano
- Division of Health Medical Intelligence, Institute of Medical Science, the University of Tokyo, Tokyo, Japan
- M&D Data Science Center, Tokyo Medical and Dental University, Tokyo, Japan
| | - Rui Yamaguchi
- Division of Health Medical Intelligence, Institute of Medical Science, the University of Tokyo, Tokyo, Japan
- Division of Cancer Systems Biology, Aichi Cancer Center Research Institute, Nagoya, Japan
- Division of Cancer Informatics, Nagoya University Graduate School of Medicine, Nagoya, Japan
| |
Collapse
|
16
|
Wold J, Koepfli KP, Galla SJ, Eccles D, Hogg CJ, Le Lec MF, Guhlin J, Santure AW, Steeves TE. Expanding the conservation genomics toolbox: Incorporating structural variants to enhance genomic studies for species of conservation concern. Mol Ecol 2021; 30:5949-5965. [PMID: 34424587 PMCID: PMC9290615 DOI: 10.1111/mec.16141] [Citation(s) in RCA: 21] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/20/2021] [Revised: 07/28/2021] [Accepted: 08/18/2021] [Indexed: 12/28/2022]
Abstract
Structural variants (SVs) are large rearrangements (>50 bp) within the genome that impact gene function and the content and structure of chromosomes. As a result, SVs are a significant source of functional genomic variation, that is, variation at genomic regions underpinning phenotype differences, that can have large effects on individual and population fitness. While there are increasing opportunities to investigate functional genomic variation in threatened species via single nucleotide polymorphism (SNP) data sets, SVs remain understudied despite their potential influence on fitness traits of conservation interest. In this future-focused Opinion, we contend that characterizing SVs offers the conservation genomics community an exciting opportunity to complement SNP-based approaches to enhance species recovery. We also leverage the existing literature-predominantly in human health, agriculture and ecoevolutionary biology-to identify approaches for readily characterizing SVs and consider how integrating these into the conservation genomics toolbox may transform the way we manage some of the world's most threatened species.
Collapse
Affiliation(s)
- Jana Wold
- School of Biological Sciences, University of Canterbury, Christchurch, New Zealand
| | - Klaus-Peter Koepfli
- Smithsonian-Mason School of Conservation, Front Royal, Virginia, USA.,Centre for Species Survival, Smithsonian Conservation Biology Institute, National Zoological Park, Washington, District of Columbia, USA.,Computer Technologies Laboratory, ITMO University, Saint Petersburg, Russia
| | - Stephanie J Galla
- School of Biological Sciences, University of Canterbury, Christchurch, New Zealand.,Department of Biological Sciences, Boise State University, Boise, Idaho, USA
| | - David Eccles
- Malaghan Institute of Medical Research, Wellington, New Zealand
| | - Carolyn J Hogg
- School of Life and Environmental Sciences, The University of Sydney, Sydney, NSW, Australia
| | - Marissa F Le Lec
- Department of Biochemistry, University of Otago, Dunedin, Otago, New Zealand
| | - Joseph Guhlin
- Department of Biochemistry, University of Otago, Dunedin, Otago, New Zealand.,Genomics Aotearoa, Dunedin, Otago, New Zealand
| | - Anna W Santure
- School of Biological Sciences, The University of Auckland, Auckland, New Zealand
| | - Tammy E Steeves
- School of Biological Sciences, University of Canterbury, Christchurch, New Zealand
| |
Collapse
|
17
|
Benchmarking germline CNV calling tools from exome sequencing data. Sci Rep 2021; 11:14416. [PMID: 34257369 PMCID: PMC8277855 DOI: 10.1038/s41598-021-93878-2] [Citation(s) in RCA: 29] [Impact Index Per Article: 9.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/15/2020] [Accepted: 06/29/2021] [Indexed: 02/06/2023] Open
Abstract
Whole-exome sequencing is an attractive alternative to microarray analysis because of the low cost and potential ability to detect copy number variations (CNV) of various sizes (from 1-2 exons to several Mb). Previous comparison of the most popular CNV calling tools showed a high portion of false-positive calls. Moreover, due to a lack of a gold standard CNV set, the results are limited and incomparable. Here, we aimed to perform a comprehensive analysis of tools capable of germline CNV calling available at the moment using a single CNV standard and reference sample set. Compiling variants from previous studies with Bayesian estimation approach, we constructed an internal standard for NA12878 sample (pilot National Institute of Standards and Technology Reference Material) including 110,050 CNV or non-CNV exons. The standard was used to evaluate the performance of 16 germline CNV calling tools on the NA12878 sample and 10 correlated exomes as a reference set with respect to length distribution, concordance, and efficiency. Each algorithm had a certain range of detected lengths and showed low concordance with other tools. Most tools are focused on detection of a limited number of CNVs one to seven exons long with a false-positive rate below 50%. EXCAVATOR2, exomeCopy, and FishingCNV focused on detection of a wide range of variations but showed low precision. Upon unified comparison, the tools were not equivalent. The analysis performed allows choosing algorithms or ensembles of algorithms most suitable for a specific goal, e.g. population studies or medical genetics.
Collapse
|
18
|
Linderman MD, Paudyal C, Shakeel M, Kelley W, Bashir A, Gelb BD. NPSV: A simulation-driven approach to genotyping structural variants in whole-genome sequencing data. Gigascience 2021; 10:giab046. [PMID: 34195837 PMCID: PMC8246072 DOI: 10.1093/gigascience/giab046] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/21/2020] [Revised: 05/04/2021] [Accepted: 06/07/2021] [Indexed: 12/20/2022] Open
Abstract
BACKGROUND Structural variants (SVs) play a causal role in numerous diseases but are difficult to detect and accurately genotype (determine zygosity) in whole-genome next-generation sequencing data. SV genotypers that assume that the aligned sequencing data uniformly reflect the underlying SV or use existing SV call sets as training data can only partially account for variant and sample-specific biases. RESULTS We introduce NPSV, a machine learning-based approach for genotyping previously discovered SVs that uses next-generation sequencing simulation to model the combined effects of the genomic region, sequencer, and alignment pipeline on the observed SV evidence. We evaluate NPSV alongside existing SV genotypers on multiple benchmark call sets. We show that NPSV consistently achieves or exceeds state-of-the-art genotyping accuracy across SV call sets, samples, and variant types. NPSV can specifically identify putative de novo SVs in a trio context and is robust to offset SV breakpoints. CONCLUSIONS Growing SV databases and the increasing availability of SV calls from long-read sequencing make stand-alone genotyping of previously identified SVs an increasingly important component of genome analyses. By treating potential biases as a "black box" that can be simulated, NPSV provides a framework for accurately genotyping a broad range of SVs in both targeted and genome-scale applications.
Collapse
Affiliation(s)
- Michael D Linderman
- Department of Computer Science, Middlebury College, 14 Old Chapel Road, Middlebury, VT 05753, USA
| | - Crystal Paudyal
- Department of Computer Science, Middlebury College, 14 Old Chapel Road, Middlebury, VT 05753, USA
| | - Musab Shakeel
- Department of Computer Science, Middlebury College, 14 Old Chapel Road, Middlebury, VT 05753, USA
| | - William Kelley
- Department of Computer Science, Middlebury College, 14 Old Chapel Road, Middlebury, VT 05753, USA
| | - Ali Bashir
- Google, 1600 Amphitheatre Parkway, Mountain View, CA 94043, USA
| | - Bruce D Gelb
- Mindich Child Health and Development Institute and the Departments of Pediatrics and Genetics and Genomic Sciences, Icahn School of Medicine at Mount Sinai, One Gustave Levy Place, Box 1040, New York, NY 10029, USA
| |
Collapse
|
19
|
Raeisi Dehkordi S, Luebeck J, Bafna V. FaNDOM: Fast nested distance-based seeding of optical maps. PATTERNS (NEW YORK, N.Y.) 2021; 2:100248. [PMID: 34027500 PMCID: PMC8134938 DOI: 10.1016/j.patter.2021.100248] [Citation(s) in RCA: 9] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 01/20/2021] [Revised: 03/08/2021] [Accepted: 04/01/2021] [Indexed: 12/25/2022]
Abstract
Optical mapping (OM) provides single-molecule readouts of fluorescently labeled sequence motifs on long fragments of DNA, resolved to nucleotide-level coordinates. With the advent of microfluidic technologies for analysis of DNA molecules, it is possible to inexpensively generate long OM data ( > 150 kbp) at high coverage. In addition to scaffolding for de novo assembly, OM data can be aligned to a reference genome for identification of genomic structural variants. We introduce FaNDOM (Fast Nested Distance Seeding of Optical Maps)-an optical map alignment tool that greatly reduces the search space of the alignment process. On four benchmark human datasets, FaNDOM was significantly (4-14×) faster than competing tools while maintaining comparable sensitivity and specificity. We used FaNDOM to map variants in three cancer cell lines and identified many biologically interesting structural variants, including deletions, duplications, gene fusions and gene-disrupting rearrangements. FaNDOM is publicly available at https://github.com/jluebeck/FaNDOM.
Collapse
Affiliation(s)
- Siavash Raeisi Dehkordi
- Department of Computer Science & Engineering, University of California, San Diego, La Jolla, CA 92093, USA
| | - Jens Luebeck
- Department of Computer Science & Engineering, University of California, San Diego, La Jolla, CA 92093, USA
- Bioinformatics & Systems Biology Graduate Program, University of California, San Diego, La Jolla, CA 92093, USA
| | - Vineet Bafna
- Department of Computer Science & Engineering, University of California, San Diego, La Jolla, CA 92093, USA
| |
Collapse
|
20
|
Moreno-Cabrera JM, Del Valle J, Castellanos E, Feliubadaló L, Pineda M, Serra E, Capellá G, Lázaro C, Gel B. CNVfilteR: an R/bioconductor package to identify false positives produced by germline NGS CNV detection tools. Bioinformatics 2021; 37:4227-4229. [PMID: 33983414 PMCID: PMC9502136 DOI: 10.1093/bioinformatics/btab356] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/09/2020] [Revised: 03/06/2021] [Accepted: 05/12/2021] [Indexed: 11/14/2022] Open
Abstract
Germline copy-number variants (CNVs) are relevant mutations for multiple genetics fields, such as the study of hereditary diseases. However, available benchmarks show that all next-generation sequencing (NGS) CNV calling tools produce false positives. We developed CNVfilteR, an R package that uses the single nucleotide variant calls usually obtained in germline NGS pipelines to identify those false positives. The package can detect both false deletions and false duplications. We evaluated CNVfilteR performance on callsets generated by 13 CNV calling tools on 3 whole-genome sequencing and 541 panel samples, showing a decrease of up to 44.8% in false positives and consistent F1-score increase. Using CNVfilteR to detect false-positive calls can improve the overall performance of existing CNV calling pipelines. AVAILABILITY CNVfilteR is released under Artistic-2.0 License. Source code and documentation are freely available at Bioconductor (http://www.bioconductor.org/packages/CNVfilteR). SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- José Marcos Moreno-Cabrera
- Hereditary Cancer Group, Program for Predictive and Personalized Medicine of Cancer, Germans Trias i Pujol Research Institute (PMPPC-IGTP), Campus, Ruti Badalona Barcelona, Can Spain.,Hereditary Cancer Program, Joint Program on Hereditary Cancer, Catalan Institute of Oncology, Institut d'Investigació Biomèdica de Bellvitge-IDIBELL, L'Hospitalet de Llobregat, Barcelona, Spain.,Instituto de Salud Carlos III, Centro de Investigación Biomédica en Red Cáncer (CIBERONC), Madrid, Spain
| | - Jesús Del Valle
- Hereditary Cancer Program, Joint Program on Hereditary Cancer, Catalan Institute of Oncology, Institut d'Investigació Biomèdica de Bellvitge-IDIBELL, L'Hospitalet de Llobregat, Barcelona, Spain.,Instituto de Salud Carlos III, Centro de Investigación Biomédica en Red Cáncer (CIBERONC), Madrid, Spain
| | - Elisabeth Castellanos
- Hereditary Cancer Group, Program for Predictive and Personalized Medicine of Cancer, Germans Trias i Pujol Research Institute (PMPPC-IGTP), Campus, Ruti Badalona Barcelona, Can Spain.,Clinical Genomics Unit, Clinical Genetics Service, Northern Metropolitan Clinical Laboratory, Germans Trias i Pujol University Hospital (HUGTiP), Ruti, Campus Badalona Barcelona, Can Spain
| | - Lidia Feliubadaló
- Hereditary Cancer Program, Joint Program on Hereditary Cancer, Catalan Institute of Oncology, Institut d'Investigació Biomèdica de Bellvitge-IDIBELL, L'Hospitalet de Llobregat, Barcelona, Spain.,Instituto de Salud Carlos III, Centro de Investigación Biomédica en Red Cáncer (CIBERONC), Madrid, Spain
| | - Marta Pineda
- Hereditary Cancer Program, Joint Program on Hereditary Cancer, Catalan Institute of Oncology, Institut d'Investigació Biomèdica de Bellvitge-IDIBELL, L'Hospitalet de Llobregat, Barcelona, Spain.,Instituto de Salud Carlos III, Centro de Investigación Biomédica en Red Cáncer (CIBERONC), Madrid, Spain
| | - Eduard Serra
- Hereditary Cancer Group, Program for Predictive and Personalized Medicine of Cancer, Germans Trias i Pujol Research Institute (PMPPC-IGTP), Campus, Ruti Badalona Barcelona, Can Spain.,Instituto de Salud Carlos III, Centro de Investigación Biomédica en Red Cáncer (CIBERONC), Madrid, Spain
| | - Gabriel Capellá
- Hereditary Cancer Program, Joint Program on Hereditary Cancer, Catalan Institute of Oncology, Institut d'Investigació Biomèdica de Bellvitge-IDIBELL, L'Hospitalet de Llobregat, Barcelona, Spain.,Instituto de Salud Carlos III, Centro de Investigación Biomédica en Red Cáncer (CIBERONC), Madrid, Spain
| | - Conxi Lázaro
- Hereditary Cancer Program, Joint Program on Hereditary Cancer, Catalan Institute of Oncology, Institut d'Investigació Biomèdica de Bellvitge-IDIBELL, L'Hospitalet de Llobregat, Barcelona, Spain.,Instituto de Salud Carlos III, Centro de Investigación Biomédica en Red Cáncer (CIBERONC), Madrid, Spain
| | - Bernat Gel
- Hereditary Cancer Group, Program for Predictive and Personalized Medicine of Cancer, Germans Trias i Pujol Research Institute (PMPPC-IGTP), Campus, Ruti Badalona Barcelona, Can Spain
| |
Collapse
|
21
|
Gu W, Zhou A, Wang L, Sun S, Cui X, Zhu D. SVLR: Genome Structural Variant Detection Using Long-Read Sequencing Data. J Comput Biol 2021; 28:774-788. [PMID: 33973820 DOI: 10.1089/cmb.2021.0048] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/11/2023] Open
Abstract
Genome structural variants (SVs) have great impacts on human phenotype and diversity, and have been linked to numerous diseases. Long-read sequencing technologies arise to make it possible to find SVs of as long as 10,000 nucleotides. Thus, long read-based SV detection has been drawing attention of many recent research projects, and many tools have been developed for long reads to detect SVs recently. In this article, we present a new method, called SVLR, to detect SVs based on long-read sequencing data. Comparing with existing methods, SVLR can detect three new kinds of SVs: block replacements, block interchanges, and translocations. Although these new SVs are structurally more complicated, SVLR achieves accuracies that are comparable with those of the classic SVs. Moreover, for the classic SVs that can be detected by state-of-the-art methods (e.g., SVIM and Sniffles), our experiments demonstrate recall improvements of up to 38% without harming the precisions (i.e., >78%). We also point out three directions to further improve SV detection in the future. Source codes: https://github.com/GWYSDU/SVLR.
Collapse
Affiliation(s)
- Wenyan Gu
- School of Computer Science and Technology, Shandong University, Qindao, China
| | - Aizhong Zhou
- School of Computer Science and Technology, Shandong University, Qindao, China
| | - Lusheng Wang
- Department of Computer Science, City University of Hong Kong, Hong Kong, China
| | - Shiwei Sun
- Key Laboratory of Intelligent Information Processing, Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China
| | - Xuefeng Cui
- School of Computer Science and Technology, Shandong University, Qindao, China
| | - Daming Zhu
- School of Computer Science and Technology, Shandong University, Qindao, China
| |
Collapse
|
22
|
Jones W, Gong B, Novoradovskaya N, Li D, Kusko R, Richmond TA, Johann DJ, Bisgin H, Sahraeian SME, Bushel PR, Pirooznia M, Wilkins K, Chierici M, Bao W, Basehore LS, Lucas AB, Burgess D, Butler DJ, Cawley S, Chang CJ, Chen G, Chen T, Chen YC, Craig DJ, Del Pozo A, Foox J, Francescatto M, Fu Y, Furlanello C, Giorda K, Grist KP, Guan M, Hao Y, Happe S, Hariani G, Haseley N, Jasper J, Jurman G, Kreil DP, Łabaj P, Lai K, Li J, Li QZ, Li Y, Li Z, Liu Z, López MS, Miclaus K, Miller R, Mittal VK, Mohiyuddin M, Pabón-Peña C, Parsons BL, Qiu F, Scherer A, Shi T, Stiegelmeyer S, Suo C, Tom N, Wang D, Wen Z, Wu L, Xiao W, Xu C, Yu Y, Zhang J, Zhang Y, Zhang Z, Zheng Y, Mason CE, Willey JC, Tong W, Shi L, Xu J. A verified genomic reference sample for assessing performance of cancer panels detecting small variants of low allele frequency. Genome Biol 2021; 22:111. [PMID: 33863366 PMCID: PMC8051128 DOI: 10.1186/s13059-021-02316-z] [Citation(s) in RCA: 18] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/27/2020] [Accepted: 03/18/2021] [Indexed: 12/30/2022] Open
Abstract
BACKGROUND Oncopanel genomic testing, which identifies important somatic variants, is increasingly common in medical practice and especially in clinical trials. Currently, there is a paucity of reliable genomic reference samples having a suitably large number of pre-identified variants for properly assessing oncopanel assay analytical quality and performance. The FDA-led Sequencing and Quality Control Phase 2 (SEQC2) consortium analyze ten diverse cancer cell lines individually and their pool, termed Sample A, to develop a reference sample with suitably large numbers of coding positions with known (variant) positives and negatives for properly evaluating oncopanel analytical performance. RESULTS In reference Sample A, we identify more than 40,000 variants down to 1% allele frequency with more than 25,000 variants having less than 20% allele frequency with 1653 variants in COSMIC-related genes. This is 5-100× more than existing commercially available samples. We also identify an unprecedented number of negative positions in coding regions, allowing statistical rigor in assessing limit-of-detection, sensitivity, and precision. Over 300 loci are randomly selected and independently verified via droplet digital PCR with 100% concordance. Agilent normal reference Sample B can be admixed with Sample A to create new samples with a similar number of known variants at much lower allele frequency than what exists in Sample A natively, including known variants having allele frequency of 0.02%, a range suitable for assessing liquid biopsy panels. CONCLUSION These new reference samples and their admixtures provide superior capability for performing oncopanel quality control, analytical accuracy, and validation for small to large oncopanels and liquid biopsy assays.
Collapse
Affiliation(s)
- Wendell Jones
- Q2 Solutions - EA Genomics, 5927 S Miami Blvd., Morrisville, NC, 27560, USA.
| | - Binsheng Gong
- Division of Bioinformatics and Biostatistics, National Center for Toxicological Research, US Food and Drug Administration, Jefferson, AR, 72079, USA
| | | | - Dan Li
- Division of Bioinformatics and Biostatistics, National Center for Toxicological Research, US Food and Drug Administration, Jefferson, AR, 72079, USA
| | - Rebecca Kusko
- Immuneering Corporation, One Broadway, 14th Floor, Cambridge, MA, 02142, USA
| | - Todd A Richmond
- Market & Application Development Bioinformatics, Roche Sequencing Solutions Inc., 4300 Hacienda Dr., Pleasanton, CA, 94588, USA
| | - Donald J Johann
- Winthrop P Rockefeller Cancer Institute, University of Arkansas for Medical Sciences, 4301 W Markham St., Little Rock, AR, 72205, USA
| | - Halil Bisgin
- Department of Computer Science, Engineering and Physics, University of Michigan-Flint, Flint, MI, 48502, USA
| | - Sayed Mohammad Ebrahim Sahraeian
- Bioinformatics Research & Early Development, Roche Sequencing Solutions Inc., 1301 Shoreway Rd., Suite 7 #300, Belmont, CA, 94002, USA
| | - Pierre R Bushel
- National Institute of Environmental Health Sciences, Research Triangle Park, Durham, NC, 27709, USA
| | - Mehdi Pirooznia
- Bioinformatics and Computational Biology Laboratory, National Heart Lung and Blood Institute, National Institutes of Health, Bethesda, MD, 20892, USA
| | - Katherine Wilkins
- Agilent Technologies, 5301 Stevens Creek Blvd., Santa Clara, CA, 95051, USA
| | | | - Wenjun Bao
- JMP Life Sciences, SAS Institute Inc., Cary, NC, 27519, USA
| | - Lee Scott Basehore
- Agilent Technologies, 11011 N Torrey Pines Rd., La Jolla, CA, 92037, USA
| | | | - Daniel Burgess
- (formerly) Research and Development, Roche Sequencing Solutions Inc., 500 South Rosa Rd., Madison, WI, 53719, USA
| | - Daniel J Butler
- Department of Physiology and Biophysics, Weill Cornell Medicine, Cornell University, New York, NY, 10065, USA
| | - Simon Cawley
- (formerly) Clinical Sequencing Division, Thermo Fisher Scientific, 180 Oyster Point Blvd., South San Francisco, CA, 94080, USA
| | - Chia-Jung Chang
- Stanford Genome Technology Center, Stanford University, Palo Alto, CA, 94304, USA
| | - Guangchun Chen
- Department of Immunology, Genomics and Microarray Core Facility, University of Texas Southwestern Medical Center, 5323 Harry Hine Blvd., Dallas, TX, 75390, USA
| | - Tao Chen
- University of Texas Southwestern Medical Center, 2330 Inwood Rd., Dallas, TX, 75390, USA
| | - Yun-Ching Chen
- Bioinformatics and Computational Biology Laboratory, National Heart Lung and Blood Institute, National Institutes of Health, Bethesda, MD, 20892, USA
| | - Daniel J Craig
- Department of Medicine, College of Medicine and Life Sciences, The University of Toledo, Toledo, OH, 43614, USA
| | - Angela Del Pozo
- Institute of Medical and Molecular Genetics (INGEMM), Hospital Universitario La Paz, CIBERER Instituto de Salud Carlos III, 28046, Madrid, Spain
| | - Jonathan Foox
- Department of Physiology and Biophysics, Weill Cornell Medicine, Cornell University, New York, NY, 10065, USA
| | | | - Yutao Fu
- Thermo Fisher Scientific, 110 Miller Ave., Ann Arbor, MI, 48104, USA
| | | | - Kristina Giorda
- Marketing, Integrated DNA Technologies, Inc., 1710 Commercial Park, Coralville, IA, 52241, USA
| | - Kira P Grist
- Q2 Solutions - EA Genomics, 5927 S Miami Blvd., Morrisville, NC, 27560, USA
| | - Meijian Guan
- JMP Life Sciences, SAS Institute Inc., Cary, NC, 27519, USA
| | - Yingyi Hao
- College of Chemistry, Sichuan University, Chengdu, 610064, Sichuan, China
| | - Scott Happe
- Agilent Technologies, 1834 State Hwy 71 West, Cedar Creek, TX, 78612, USA
| | - Gunjan Hariani
- Q2 Solutions - EA Genomics, 5927 S Miami Blvd., Morrisville, NC, 27560, USA
| | - Nathan Haseley
- Illumina Inc., 5200 Illumina Way, San Diego, CA, 92122, USA
| | - Jeff Jasper
- Q2 Solutions - EA Genomics, 5927 S Miami Blvd., Morrisville, NC, 27560, USA
| | | | - David Philip Kreil
- Bioinformatics Research, Institute of Molecular Biotechnology, Boku University Vienna, Vienna, Austria
| | - Paweł Łabaj
- Małopolska Centre of Biotechnology, Jagiellonian University, Krakow, Poland
- Department of Biotechnology, Boku University, Vienna, Austria
| | - Kevin Lai
- Bioinformatics, Integrated DNA Technologies, Inc., 1710 Commercial Park, Coralville, IA, 52241, USA
| | - Jianying Li
- Kelly Government Solutions, Inc., Research Triangle Park, NC, 27709, USA
| | - Quan-Zhen Li
- Department of Immunology, Genomics and Microarray Core Facility, University of Texas Southwestern Medical Center, 5323 Harry Hine Blvd., Dallas, TX, 75390, USA
| | - Yulong Li
- Center of Genome and Personalized Medicine, Institute of Cancer Stem Cell, Dalian Medical University, Dalian, Liaoning, China
| | - Zhiguang Li
- Center of Genome and Personalized Medicine, Institute of Cancer Stem Cell, Dalian Medical University, Dalian, Liaoning, China
| | - Zhichao Liu
- Division of Bioinformatics and Biostatistics, National Center for Toxicological Research, US Food and Drug Administration, Jefferson, AR, 72079, USA
| | - Mario Solís López
- Institute of Medical and Molecular Genetics (INGEMM), Hospital Universitario La Paz, CIBERER Instituto de Salud Carlos III, 28046, Madrid, Spain
- EATRIS ERIC- European Infrastructure for Translational Medicine, De Boelelaan 1118, 1081, HZ, Amsterdam, The Netherlands
| | - Kelci Miclaus
- JMP Life Sciences, SAS Institute Inc., Cary, NC, 27519, USA
| | - Raymond Miller
- Agilent Technologies, 5301 Stevens Creek Blvd., Santa Clara, CA, 95051, USA
| | - Vinay K Mittal
- Thermo Fisher Scientific, 110 Miller Ave., Ann Arbor, MI, 48104, USA
| | - Marghoob Mohiyuddin
- Bioinformatics Research & Early Development, Roche Sequencing Solutions Inc., 1301 Shoreway Rd., Suite 7 #300, Belmont, CA, 94002, USA
| | - Carlos Pabón-Peña
- Agilent Technologies, 5301 Stevens Creek Blvd., Santa Clara, CA, 95051, USA
| | - Barbara L Parsons
- Division of Genetic and Molecular Toxicology, National Center for Toxicological Research, US Food and Drug Administration, Jefferson, AR, 72079, USA
| | - Fujun Qiu
- Research and Development, Burning Rock Biotech, Shanghai, 201114, China
| | - Andreas Scherer
- EATRIS ERIC- European Infrastructure for Translational Medicine, De Boelelaan 1118, 1081, HZ, Amsterdam, The Netherlands
- Institute for Molecular Medicine Finland (FIMM), Nordic EMBL Partnership for Molecular Medicine, HiLIFE Unit, Biomedicum Helsinki 2U (D302b), FI-00014 University of Helsinki, P.O. Box 20 (Tukholmankatu 8), Helsinki, Finland
| | - Tieliu Shi
- Center for Bioinformatics and Computational Biology, and the Institute of Biomedical Sciences, School of Life Sciences, East China Normal University, 500 Dongchuan Rd, Shanghai, 200241, China
| | - Suzy Stiegelmeyer
- University of North Carolina Health, 101 Manning Drive, Chapel Hill, NC, 27514, USA
| | - Chen Suo
- Department of Epidemiology, School of Public Health, Fudan University, Shanghai, China
| | - Nikola Tom
- EATRIS ERIC- European Infrastructure for Translational Medicine, De Boelelaan 1118, 1081, HZ, Amsterdam, The Netherlands
- Center of Molecular Medicine, Central European Institute of Technology, Masaryk University, Kamenice 5, 625 00, Brno, Czech Republic
| | - Dong Wang
- Division of Bioinformatics and Biostatistics, National Center for Toxicological Research, US Food and Drug Administration, Jefferson, AR, 72079, USA
| | - Zhining Wen
- College of Chemistry, Sichuan University, Chengdu, 610064, Sichuan, China
| | - Leihong Wu
- Division of Bioinformatics and Biostatistics, National Center for Toxicological Research, US Food and Drug Administration, Jefferson, AR, 72079, USA
| | - Wenzhong Xiao
- Stanford Genome Technology Center, Stanford University, Palo Alto, CA, 94304, USA
- Massachusetts General Hospital, Harvard Medical School, Boston, MA, 02114, USA
| | - Chang Xu
- Research and Development, QIAGEN Sciences Inc., Frederick, MD, 21703, USA
| | - Ying Yu
- State Key Laboratory of Genetic Engineering, School of Life Sciences and Shanghai Cancer Hospital/Cancer Institute, Fudan University, Shanghai, 200438, China
| | - Jiyang Zhang
- State Key Laboratory of Genetic Engineering, School of Life Sciences and Shanghai Cancer Hospital/Cancer Institute, Fudan University, Shanghai, 200438, China
| | - Yifan Zhang
- University of Arkansas at Little Rock, Little Rock, AR, 72204, USA
| | - Zhihong Zhang
- Research and Development, Burning Rock Biotech, Shanghai, 201114, China
| | - Yuanting Zheng
- Division of Bioinformatics and Biostatistics, National Center for Toxicological Research, US Food and Drug Administration, Jefferson, AR, 72079, USA
- State Key Laboratory of Genetic Engineering, School of Life Sciences and Shanghai Cancer Hospital/Cancer Institute, Fudan University, Shanghai, 200438, China
| | - Christopher E Mason
- Department of Physiology and Biophysics, Weill Cornell Medicine, Cornell University, New York, NY, 10065, USA
| | - James C Willey
- Departments of Medicine, Pathology, and Cancer Biology, College of Medicine and Life Sciences, University of Toledo Health Sciences Campus, 3000 Arlington Ave, Toledo, OH, 43614, USA
| | - Weida Tong
- Division of Bioinformatics and Biostatistics, National Center for Toxicological Research, US Food and Drug Administration, Jefferson, AR, 72079, USA
| | - Leming Shi
- State Key Laboratory of Genetic Engineering, School of Life Sciences and Shanghai Cancer Hospital/Cancer Institute, Fudan University, Shanghai, 200438, China
- Human Phenome Institute, Fudan University, Shanghai, 201203, China
- Fudan-Gospel Joint Research Center for Precision Medicine, Fudan University, Shanghai, 200438, China
| | - Joshua Xu
- Division of Bioinformatics and Biostatistics, National Center for Toxicological Research, US Food and Drug Administration, Jefferson, AR, 72079, USA.
| |
Collapse
|
23
|
Sun Y, Liu F, Fan C, Wang Y, Song L, Fang Z, Han R, Wang Z, Wang X, Yang Z, Xu Z, Peng J, Shi C, Zhang H, Dong W, Huang H, Li Y, Le Y, Sun J, Peng Z. Characterizing sensitivity and coverage of clinical WGS as a diagnostic test for genetic disorders. BMC Med Genomics 2021; 14:102. [PMID: 33849535 PMCID: PMC8045368 DOI: 10.1186/s12920-021-00948-5] [Citation(s) in RCA: 11] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/20/2020] [Accepted: 03/31/2021] [Indexed: 12/30/2022] Open
Abstract
Background Due to its reduced cost and incomparable advantages, WGS is likely to lead to changes in clinical diagnosis of rare and undiagnosed diseases. However, the sensitivity and breadth of coverage of clinical WGS as a diagnostic test for genetic disorders has not been fully evaluated. Methods Here, the performance of WGS in NA12878, the YH cell line, and the Chinese trios were measured by assessing their sensitivity, PPV, depth and breadth of coverage using MGISEQ-2000. We also compared the performance of WES and WGS using NA12878. The sensitivity and PPV were tested using the family-based trio design for the Chinese trios. We further developed a systematic WGS pipeline for the analysis of 8 clinical cases. Results In general, the sensitivity and PPV for SNV/indel detection increased with mean depth and reached a plateau at an ~ 40X mean depth using down-sampling samples of NA12878. With a mean depth of 40X, the sensitivity of homozygous and heterozygous SNPs of NA12878 was > 99.25% and > 99.50%, respectively, and the PPV was 99.97% and 98.96%. Homozygous and heterozygous indels showed lower sensitivity and PPV. The sensitivity and PPV were still not 100% even with a mean depth of ~ 150X. We also observed a substantial variation in the sensitivity of CNV detection across different tools, especially in CNVs with a size less than 1 kb. In general, the breadth of coverage for disease-associated genes and CNVs increased with mean depth. The sensitivity and coverage of WGS (~ 40X) was better than WES (~ 120X). Among the Chinese trios with an ~ 40X mean depth, the sensitivity among offspring was > 99.48% and > 96.36% for SNP and indel detection, and the PPVs were 99.86% and 97.93%. All 12 previously validated variants in the 8 clinical cases were successfully detected using our WGS pipeline. Conclusions The current standard of a mean depth of 40X may be sufficient for SNV/indel detection and identification of most CNVs. It would be advisable for clinical scientists to determine the range of sensitivity and PPV for different classes of variants for a particular WGS pipeline, which would be useful when interpreting and delivering clinical reports. Supplementary Information The online version contains supplementary material available at 10.1186/s12920-021-00948-5.
Collapse
Affiliation(s)
- Yan Sun
- BGI Genomics, BGI-Shenzhen, Shenzhen, 518083, China
| | - Fengxia Liu
- Tianjin Medical Laboratory, BGI-Tianjin, BGI-Shenzhen, Tianjin, 300308, China.,Binhai Genomics Institute, BGI-Tianjin, BGI-Shenzhen, Tianjin, 300308, China
| | - Chunna Fan
- Tianjin Medical Laboratory, BGI-Tianjin, BGI-Shenzhen, Tianjin, 300308, China.,Binhai Genomics Institute, BGI-Tianjin, BGI-Shenzhen, Tianjin, 300308, China
| | - Yaoshen Wang
- Tianjin Medical Laboratory, BGI-Tianjin, BGI-Shenzhen, Tianjin, 300308, China.,Binhai Genomics Institute, BGI-Tianjin, BGI-Shenzhen, Tianjin, 300308, China
| | - Lijie Song
- Tianjin Medical Laboratory, BGI-Tianjin, BGI-Shenzhen, Tianjin, 300308, China.,Binhai Genomics Institute, BGI-Tianjin, BGI-Shenzhen, Tianjin, 300308, China
| | - Zhonghai Fang
- Tianjin Medical Laboratory, BGI-Tianjin, BGI-Shenzhen, Tianjin, 300308, China.,Binhai Genomics Institute, BGI-Tianjin, BGI-Shenzhen, Tianjin, 300308, China
| | - Rui Han
- Tianjin Medical Laboratory, BGI-Tianjin, BGI-Shenzhen, Tianjin, 300308, China.,Binhai Genomics Institute, BGI-Tianjin, BGI-Shenzhen, Tianjin, 300308, China
| | - Zhonghua Wang
- Tianjin Medical Laboratory, BGI-Tianjin, BGI-Shenzhen, Tianjin, 300308, China.,Binhai Genomics Institute, BGI-Tianjin, BGI-Shenzhen, Tianjin, 300308, China
| | - Xiaodan Wang
- Tianjin Medical Laboratory, BGI-Tianjin, BGI-Shenzhen, Tianjin, 300308, China.,Binhai Genomics Institute, BGI-Tianjin, BGI-Shenzhen, Tianjin, 300308, China
| | - Ziying Yang
- Tianjin Medical Laboratory, BGI-Tianjin, BGI-Shenzhen, Tianjin, 300308, China.,Binhai Genomics Institute, BGI-Tianjin, BGI-Shenzhen, Tianjin, 300308, China
| | - Zhenpeng Xu
- BGI Genomics, BGI-Shenzhen, Shenzhen, 518083, China
| | - Jiguang Peng
- BGI Genomics, BGI-Shenzhen, Shenzhen, 518083, China
| | - Chaonan Shi
- Tianjin Medical Laboratory, BGI-Tianjin, BGI-Shenzhen, Tianjin, 300308, China.,Binhai Genomics Institute, BGI-Tianjin, BGI-Shenzhen, Tianjin, 300308, China
| | | | - Wei Dong
- BGI-Beijing Clinical Laboratories, BGI-Shenzhen, Beijing, 101300, China
| | - Hui Huang
- BGI Genomics, BGI-Shenzhen, Shenzhen, 518083, China
| | - Yun Li
- BGI Genomics, BGI-Shenzhen, Shenzhen, 518083, China
| | - Yanqun Le
- Tianjin Medical Laboratory, BGI-Tianjin, BGI-Shenzhen, Tianjin, 300308, China
| | - Jun Sun
- Tianjin Medical Laboratory, BGI-Tianjin, BGI-Shenzhen, Tianjin, 300308, China. .,Binhai Genomics Institute, BGI-Tianjin, BGI-Shenzhen, Tianjin, 300308, China.
| | - Zhiyu Peng
- BGI Genomics, BGI-Shenzhen, Shenzhen, 518083, China.
| |
Collapse
|
24
|
Robust Benchmark Structural Variant Calls of An Asian Using the State-of-art Long Fragment Sequencing Technologies. GENOMICS PROTEOMICS & BIOINFORMATICS 2021; 20:192-204. [PMID: 33662625 PMCID: PMC9510867 DOI: 10.1016/j.gpb.2020.10.006] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 04/22/2020] [Revised: 09/17/2020] [Accepted: 12/26/2020] [Indexed: 12/12/2022]
Abstract
The importance of structural variants (SVs) for human phenotypes and diseases is now recognized. Although a variety of SV detection platforms and strategies that vary in sensitivity and specificity have been developed, few benchmarking procedures are available to confidently assess their performances in biological and clinical research. To facilitate the validation and application of these SV detection approaches, we established an Asian reference material by characterizing the genome of an Epstein-Barr virus (EBV)-immortalized B lymphocyte line along with identified benchmark regions and high-confidence SV calls. We established a high-confidence SV callset with 8938 SVs by integrating four alignment-based SV callers, including 109× Pacific Biosciences (PacBio) continuous long reads (CLRs), 22× PacBio circular consensus sequencing (CCS) reads, 104× Oxford Nanopore Technologies (ONT) long reads, and 114× Bionano optical mapping platform, and one de novo assembly-based SV caller using CCS reads. A total of 544 randomly selected SVs were validated by PCR amplification and Sanger sequencing, demonstrating the robustness of our SV calls. Combining trio-binning-based haplotype assemblies, we established an SV benchmark for identifying false negatives and false positives by constructing the continuous high-confidence regions (CHCRs), which covered 1.46 gigabase pairs (Gb) and 6882 SVs supported by at least one diploid haplotype assembly. Establishing high-confidence SV calls for a benchmark sample that has been characterized by multiple technologies provides a valuable resource for investigating SVs in human biology, disease, and clinical research.
Collapse
|
25
|
Minoche AE, Lundie B, Peters GB, Ohnesorg T, Pinese M, Thomas DM, Zankl A, Roscioli T, Schonrock N, Kummerfeld S, Burnett L, Dinger ME, Cowley MJ. ClinSV: clinical grade structural and copy number variant detection from whole genome sequencing data. Genome Med 2021; 13:32. [PMID: 33632298 PMCID: PMC7908648 DOI: 10.1186/s13073-021-00841-x] [Citation(s) in RCA: 34] [Impact Index Per Article: 11.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/30/2020] [Accepted: 02/02/2021] [Indexed: 01/09/2023] Open
Abstract
Whole genome sequencing (WGS) has the potential to outperform clinical microarrays for the detection of structural variants (SV) including copy number variants (CNVs), but has been challenged by high false positive rates. Here we present ClinSV, a WGS based SV integration, annotation, prioritization, and visualization framework, which identified 99.8% of simulated pathogenic ClinVar CNVs > 10 kb and 11/11 pathogenic variants from matched microarrays. The false positive rate was low (1.5-4.5%) and reproducibility high (95-99%). In clinical practice, ClinSV identified reportable variants in 22 of 485 patients (4.7%) of which 35-63% were not detectable by current clinical microarray designs. ClinSV is available at https://github.com/KCCG/ClinSV .
Collapse
Affiliation(s)
- Andre E Minoche
- Kinghorn Centre for Clinical Genomics, Garvan Institute of Medical Research, 370 Victoria Street, Darlinghurst, NSW, Australia.
- St Vincent's Clinical School, UNSW, Sydney, NSW, Australia.
| | - Ben Lundie
- Kinghorn Centre for Clinical Genomics, Garvan Institute of Medical Research, 370 Victoria Street, Darlinghurst, NSW, Australia
| | - Greg B Peters
- Sydney Genome Diagnostics, The Children's Hospital at Westmead, Hawkesbury Road & Hainsworth Street, Westmead, NSW, Australia
| | - Thomas Ohnesorg
- Kinghorn Centre for Clinical Genomics, Garvan Institute of Medical Research, 370 Victoria Street, Darlinghurst, NSW, Australia
- Genome.One, Darlinghurst, NSW, Australia
| | - Mark Pinese
- Children's Cancer Institute, University of New South Wales, Randwick, Sydney, NSW, Australia
- School of Women's and Children's Health, UNSW, Sydney, NSW, Australia
| | - David M Thomas
- St Vincent's Clinical School, UNSW, Sydney, NSW, Australia
- The Kinghorn Cancer Centre and Cancer Division, Garvan Institute of Medical Research, 370 Victoria Street, Darlinghurst, NSW, Australia
| | - Andreas Zankl
- Kinghorn Centre for Clinical Genomics, Garvan Institute of Medical Research, 370 Victoria Street, Darlinghurst, NSW, Australia
- Department of Clinical Genetics, The Children's Hospital at Westmead, Hawkesbury Road, Westmead, NSW, Australia
- Sydney Medical School, The University of Sydney, Camperdown, NSW, Australia
| | - Tony Roscioli
- NSW Health Pathology Randwick, Sydney, NSW, Australia
- Centre for Clinical Genetics, Sydney Children's Hospital, Randwick, NSW, Australia
- Prince of Wales Clinical School, University of New South Wales, Sydney, NSW, Australia
- Neuroscience Research Australia, University of New South Wales, Randwick, Sydney, NSW, Australia
| | - Nicole Schonrock
- Kinghorn Centre for Clinical Genomics, Garvan Institute of Medical Research, 370 Victoria Street, Darlinghurst, NSW, Australia
- Genome.One, Darlinghurst, NSW, Australia
| | - Sarah Kummerfeld
- Kinghorn Centre for Clinical Genomics, Garvan Institute of Medical Research, 370 Victoria Street, Darlinghurst, NSW, Australia
- St Vincent's Clinical School, UNSW, Sydney, NSW, Australia
| | - Leslie Burnett
- Kinghorn Centre for Clinical Genomics, Garvan Institute of Medical Research, 370 Victoria Street, Darlinghurst, NSW, Australia
- St Vincent's Clinical School, UNSW, Sydney, NSW, Australia
- Genome.One, Darlinghurst, NSW, Australia
- Sydney Medical School, The University of Sydney, Camperdown, NSW, Australia
| | - Marcel E Dinger
- Kinghorn Centre for Clinical Genomics, Garvan Institute of Medical Research, 370 Victoria Street, Darlinghurst, NSW, Australia
- School of Biotechnology and Biomolecular Sciences, UNSW, Sydney, NSW, Australia
| | - Mark J Cowley
- Kinghorn Centre for Clinical Genomics, Garvan Institute of Medical Research, 370 Victoria Street, Darlinghurst, NSW, Australia.
- St Vincent's Clinical School, UNSW, Sydney, NSW, Australia.
- Children's Cancer Institute, University of New South Wales, Randwick, Sydney, NSW, Australia.
- School of Women's and Children's Health, UNSW, Sydney, NSW, Australia.
| |
Collapse
|
26
|
Krishnan V, Utiramerur S, Ng Z, Datta S, Snyder MP, Ashley EA. Benchmarking workflows to assess performance and suitability of germline variant calling pipelines in clinical diagnostic assays. BMC Bioinformatics 2021; 22:85. [PMID: 33627090 PMCID: PMC7903625 DOI: 10.1186/s12859-020-03934-3] [Citation(s) in RCA: 7] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/21/2019] [Accepted: 12/15/2020] [Indexed: 01/22/2023] Open
Abstract
BACKGROUND Benchmarking the performance of complex analytical pipelines is an essential part of developing Lab Developed Tests (LDT). Reference samples and benchmark calls published by Genome in a Bottle (GIAB) consortium have enabled the evaluation of analytical methods. The performance of such methods is not uniform across the different genomic regions of interest and variant types. Several benchmarking methods such as hap.py, vcfeval, and vcflib are available to assess the analytical performance characteristics of variant calling algorithms. However, assessing the performance characteristics of an overall LDT assay still requires stringing together several such methods and experienced bioinformaticians to interpret the results. In addition, these methods are dependent on the hardware, operating system and other software libraries, making it impossible to reliably repeat the analytical assessment, when any of the underlying dependencies change in the assay. Here we present a scalable and reproducible, cloud-based benchmarking workflow that is independent of the laboratory and the technician executing the workflow, or the underlying compute hardware used to rapidly and continually assess the performance of LDT assays, across their regions of interest and reportable range, using a broad set of benchmarking samples. RESULTS The benchmarking workflow was used to evaluate the performance characteristics for secondary analysis pipelines commonly used by Clinical Genomics laboratories in their LDT assays such as the GATK HaplotypeCaller v3.7 and the SpeedSeq workflow based on FreeBayes v0.9.10. Five reference sample truth sets generated by Genome in a Bottle (GIAB) consortium, six samples from the Personal Genome Project (PGP) and several samples with validated clinically relevant variants from the Centers for Disease Control were used in this work. The performance characteristics were evaluated and compared for multiple reportable ranges, such as whole exome and the clinical exome. CONCLUSIONS We have implemented a benchmarking workflow for clinical diagnostic laboratories that generates metrics such as specificity, precision and sensitivity for germline SNPs and InDels within a reportable range using whole exome or genome sequencing data. Combining these benchmarking results with validation using known variants of clinical significance in publicly available cell lines, we were able to establish the performance of variant calling pipelines in a clinical setting.
Collapse
Affiliation(s)
- Vandhana Krishnan
- Department of Genetics, School of Medicine, Stanford University, Stanford, CA, USA.,Stanford Center for Genomics and Personalized Medicine, Stanford University, Palo Alto, CA, USA
| | - Sowmithri Utiramerur
- Stanford Center for Genomics and Personalized Medicine, Stanford University, Palo Alto, CA, USA. .,Clinical Genomics Program, Stanford Health Care, Stanford, CA, USA. .,Roche Diagnostics Solutions, Research and Early Development, Pleasanton, CA, USA.
| | - Zena Ng
- Clinical Genomics Program, Stanford Health Care, Stanford, CA, USA
| | - Somalee Datta
- Stanford Center for Genomics and Personalized Medicine, Stanford University, Palo Alto, CA, USA.,School of Medicine, Research IT - Technology and Digital Solutions, Stanford University, Redwood City, CA, USA
| | - Michael P Snyder
- Department of Genetics, School of Medicine, Stanford University, Stanford, CA, USA.,Stanford Center for Genomics and Personalized Medicine, Stanford University, Palo Alto, CA, USA
| | - Euan A Ashley
- Department of Genetics, School of Medicine, Stanford University, Stanford, CA, USA. .,Department of Cardiovascular Medicine, Stanford University, Stanford, CA, USA. .,Department of Biomedical Data Science, Stanford University, Stanford, CA, USA.
| |
Collapse
|
27
|
Lavrichenko K, Helgeland Ø, Njølstad PR, Jonassen I, Johansson S. SeeCiTe: a method to assess CNV calls from SNP arrays using trio data. Bioinformatics 2021; 37:1876-1883. [PMID: 33459766 PMCID: PMC8317106 DOI: 10.1093/bioinformatics/btab028] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/28/2020] [Revised: 12/17/2020] [Accepted: 01/11/2021] [Indexed: 11/15/2022] Open
Abstract
Motivation Single nucleotide polymorphism (SNP) genotyping arrays remain an attractive platform for assaying copy number variants (CNVs) in large population-wide cohorts. However, current tools for calling CNVs are still prone to extensive false positive calls when applied to biobank scale arrays. Moreover, there is a lack of methods exploiting cohorts with trios available (e.g. nuclear family) to assist in quality control and downstream analyses following the calling. Results We developed SeeCiTe (Seeing CNVs in Trios), a novel CNV-quality control tool that postprocesses output from current CNV-calling tools exploiting child-parent trio data to classify calls in quality categories and provide a set of visualizations for each putative CNV call in the offspring. We apply it to the Norwegian Mother, Father and Child Cohort Study (MoBa) and show that SeeCiTe improves the specificity and sensitivity compared to the common empiric filtering strategies. To our knowledge, it is the first tool that utilizes probe-level CNV data in trios (and singletons) to systematically highlight potential artifacts and visualize signal intensities in a streamlined fashion suitable for biobank scale studies. Availability and implementation The software is implemented in R with the source code freely available at https://github.com/aksenia/SeeCiTe Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Ksenia Lavrichenko
- Computational Biology Unit, Department of Informatics, University of Bergen, Bergen, Norway.,Department of Clinical Science, University of Bergen, Bergen, Norway
| | - Øyvind Helgeland
- Department of Clinical Science, University of Bergen, Bergen, Norway.,Department of Genetics and Bioinformatics, Norwegian Institute of Public Health, Oslo, Norway
| | - Pål R Njølstad
- Department of Clinical Science, University of Bergen, Bergen, Norway.,Department of Pediatrics and Adolescents, Haukeland University Hospital, Bergen, Norway
| | - Inge Jonassen
- Computational Biology Unit, Department of Informatics, University of Bergen, Bergen, Norway
| | - Stefan Johansson
- Department of Clinical Science, University of Bergen, Bergen, Norway.,Department of Medical Genetics, Haukeland University Hospital, Bergen, Norway
| |
Collapse
|
28
|
Bhuyan MSI, Pe'er I, Rahman MS. SICaRiO: short indel call filtering with boosting. Brief Bioinform 2020; 22:5917082. [PMID: 33003198 DOI: 10.1093/bib/bbaa238] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/05/2020] [Revised: 08/26/2020] [Accepted: 08/27/2020] [Indexed: 11/14/2022] Open
Abstract
Despite impressive improvement in the next-generation sequencing technology, reliable detection of indels is still a difficult endeavour. Recognition of true indels is of prime importance in many applications, such as personalized health care, disease genomics and population genetics. Recently, advanced machine learning techniques have been successfully applied to classification problems with large-scale data. In this paper, we present SICaRiO, a gradient boosting classifier for the reliable detection of true indels, trained with the gold-standard dataset from 'Genome in a Bottle' (GIAB) consortium. Our filtering scheme significantly improves the performance of each variant calling pipeline used in GIAB and beyond. SICaRiO uses genomic features that can be computed from publicly available resources, i.e. it does not require sequencing pipeline-specific information (e.g. read depth). This study also sheds lights on prior genomic contexts responsible for the erroneous calling of indels made by sequencing pipelines. We have compared prediction difficulty for three categories of indels over different sequencing pipelines. We have also ranked genomic features according to their predictivity in determining false positives.
Collapse
Affiliation(s)
- Md Shariful Islam Bhuyan
- Department of Computer Science and Engineering, Bangladesh University of Engineering and Technology, Dhaka, Bangladesh
| | - Itsik Pe'er
- Department of Computer Science, Fu Foundation School of Engineering, and the Chair at the Center for Health Analytics, Data Science Institute, Columbia University, New York, USA
| | - M Sohel Rahman
- Department of Computer Science and Engineering, Bangladesh University of Engineering and Technology, Dhaka, Bangladesh
| |
Collapse
|
29
|
Zhuang X, Ye R, So MT, Lam WY, Karim A, Yu M, Ngo ND, Cherny SS, Tam PKH, Garcia-Barcelo MM, Tang CSM, Sham PC. A random forest-based framework for genotyping and accuracy assessment of copy number variations. NAR Genom Bioinform 2020; 2:lqaa071. [PMID: 33575619 PMCID: PMC7671382 DOI: 10.1093/nargab/lqaa071] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/25/2020] [Revised: 08/18/2020] [Accepted: 08/26/2020] [Indexed: 12/24/2022] Open
Abstract
Detection of copy number variations (CNVs) is essential for uncovering genetic factors underlying human diseases. However, CNV detection by current methods is prone to error, and precisely identifying CNVs from paired-end whole genome sequencing (WGS) data is still challenging. Here, we present a framework, CNV-JACG, for Judging the Accuracy of CNVs and Genotyping using paired-end WGS data. CNV-JACG is based on a random forest model trained on 21 distinctive features characterizing the CNV region and its breakpoints. Using the data from the 1000 Genomes Project, Genome in a Bottle Consortium, the Human Genome Structural Variation Consortium and in-house technical replicates, we show that CNV-JACG has superior sensitivity over the latest genotyping method, SV2, particularly for the small CNVs (≤1 kb). We also demonstrate that CNV-JACG outperforms SV2 in terms of Mendelian inconsistency in trios and concordance between technical replicates. Our study suggests that CNV-JACG would be a useful tool in assessing the accuracy of CNVs to meet the ever-growing needs for uncovering the missing heritability linked to CNVs.
Collapse
Affiliation(s)
- Xuehan Zhuang
- Department of Surgery, Li Ka Shing Faculty of Medicine, The University of Hong Kong, Hong Kong, China
| | - Rui Ye
- Department of Psychiatry, Li Ka Shing Faculty of Medicine, The University of Hong Kong, Hong Kong, China
| | - Man-Ting So
- Department of Surgery, Li Ka Shing Faculty of Medicine, The University of Hong Kong, Hong Kong, China
| | - Wai-Yee Lam
- Department of Surgery, Li Ka Shing Faculty of Medicine, The University of Hong Kong, Hong Kong, China
| | - Anwarul Karim
- Department of Surgery, Li Ka Shing Faculty of Medicine, The University of Hong Kong, Hong Kong, China
| | - Michelle Yu
- Department of Surgery, Li Ka Shing Faculty of Medicine, The University of Hong Kong, Hong Kong, China
| | - Ngoc Diem Ngo
- National Hospital of Pediatrics, Ha Noi 100000, Vietnam
| | - Stacey S Cherny
- Department of Psychiatry, Li Ka Shing Faculty of Medicine, The University of Hong Kong, Hong Kong, China
| | - Paul Kwong-Hang Tam
- Department of Surgery, Li Ka Shing Faculty of Medicine, The University of Hong Kong, Hong Kong, China
| | | | - Clara Sze-Man Tang
- Department of Surgery, Li Ka Shing Faculty of Medicine, The University of Hong Kong, Hong Kong, China
| | - Pak Chung Sham
- Department of Psychiatry, Li Ka Shing Faculty of Medicine, The University of Hong Kong, Hong Kong, China
| |
Collapse
|
30
|
Hayes M, Mullins D, Nguyen A. Complex Variant Discovery Using Discordant Cluster Normalization. J Comput Biol 2020; 28:185-194. [PMID: 32783649 DOI: 10.1089/cmb.2020.0249] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
Complex genomic structural variants (CGSVs) are abnormalities that present with three or more breakpoints, making their discovery a challenge. The majority of existing algorithms for structural variant detection are only designed to find simple structural variants (SSVs) such as deletions and inversions; they fail to find more complex events such as deletion-inversions or deletion-duplications, for example. In this study, we present an algorithm named CleanBreak that employs a clique partitioning graph-based strategy to identify collections of SSV clusters and then subsequently identifies overlapping SSV clusters to examine the search space of possible CGSVs, choosing the one that is most concordant with local read depth. We evaluated CleanBreak's performance on whole genome simulated data and a real data set from the 1000 Genomes Project. We also compared CleanBreak with another algorithm for CGSV discovery. The results demonstrate CleanBreak's utility as an effective method to discover CGSVs.
Collapse
Affiliation(s)
- Matthew Hayes
- Department of Physics and Computer Science and Xavier University of Louisiana, New Orleans, Louisiana, USA
| | - Derrick Mullins
- Department of Physics and Computer Science and Xavier University of Louisiana, New Orleans, Louisiana, USA
| | - Angela Nguyen
- Department of Biology, Xavier University of Louisiana, New Orleans, Louisiana, USA
| |
Collapse
|
31
|
A robust benchmark for detection of germline large deletions and insertions. Nat Biotechnol 2020; 38:1347-1355. [PMID: 32541955 PMCID: PMC8454654 DOI: 10.1038/s41587-020-0538-8] [Citation(s) in RCA: 175] [Impact Index Per Article: 43.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/16/2019] [Accepted: 04/28/2020] [Indexed: 12/19/2022]
Abstract
New technologies and analysis methods are enabling genomic structural variants (SVs) to be detected with ever-increasing accuracy, resolution, and comprehensiveness. To help translate these methods to routine research and clinical practice, we developed the first sequence-resolved benchmark set for identification of both false negative and false positive germline large insertions and deletions. To create this benchmark for a broadly consented son in a Personal Genome Project trio with broadly available cells and DNA, the Genome in a Bottle (GIAB) Consortium integrated 19 sequence-resolved variant calling methods from diverse technologies. The final benchmark set contains 12745 isolated, sequence-resolved insertion (7281) and deletion (5464) calls ≥50 base pairs (bp). The Tier 1 benchmark regions, for which any extra calls are putative false positives, cover 2.51 Gbp and 5262 insertions and 4095 deletions supported by ≥1 diploid assembly. We demonstrate the benchmark set reliably identifies false negatives and false positives in high-quality SV callsets from short-, linked-, and long-read sequencing and optical mapping.
Collapse
|
32
|
Jakubosky D, Smith EN, D'Antonio M, Jan Bonder M, Young Greenwald WW, D'Antonio-Chronowska A, Matsui H, Stegle O, Montgomery SB, DeBoever C, Frazer KA. Discovery and quality analysis of a comprehensive set of structural variants and short tandem repeats. Nat Commun 2020; 11:2928. [PMID: 32522985 PMCID: PMC7287045 DOI: 10.1038/s41467-020-16481-5] [Citation(s) in RCA: 17] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/26/2019] [Accepted: 05/05/2020] [Indexed: 02/07/2023] Open
Abstract
Structural variants (SVs) and short tandem repeats (STRs) are important sources of genetic diversity but are not routinely analyzed in genetic studies because they are difficult to accurately identify and genotype. Because SVs and STRs range in size and type, it is necessary to apply multiple algorithms that incorporate different types of evidence from sequencing data and employ complex filtering strategies to discover a comprehensive set of high-quality and reproducible variants. Here we assemble a set of 719 deep whole genome sequencing (WGS) samples (mean 42×) from 477 distinct individuals which we use to discover and genotype a wide spectrum of SV and STR variants using five algorithms. We use 177 unique pairs of genetic replicates to identify factors that affect variant call reproducibility and develop a systematic filtering strategy to create of one of the most complete and well characterized maps of SVs and STRs to date.
Collapse
Affiliation(s)
- David Jakubosky
- Biomedical Sciences Graduate Program, University of California San Diego, La Jolla, CA, 92093-0419, USA
- Department of Biomedical Informatics, University of California San Diego, La Jolla, CA, 92093-0419, USA
| | - Erin N Smith
- Department of Pediatrics, University of California San Diego, La Jolla, CA, 92093, USA
| | - Matteo D'Antonio
- Institute of Genomic Medicine, University of California San Diego, 9500 Gilman Dr, La Jolla, CA, 92093, USA
| | - Marc Jan Bonder
- European Molecular Biology Laboratory, European Bioinformatics Institute, Hinxton, Cambridge, UK
- European Molecular Biology Laboratory, Genome Biology Unit, Heidelberg, Germany
| | - William W Young Greenwald
- Bioinformatics and Systems Biology Graduate Program, University of California San Diego, La Jolla, CA, USA
| | | | - Hiroko Matsui
- Institute of Genomic Medicine, University of California San Diego, 9500 Gilman Dr, La Jolla, CA, 92093, USA
| | - Oliver Stegle
- European Molecular Biology Laboratory, European Bioinformatics Institute, Hinxton, Cambridge, UK
- European Molecular Biology Laboratory, Genome Biology Unit, Heidelberg, Germany
- Division of Computational Genomics and Systems Genetics, German Cancer Research Center, Heidelberg, Germany
| | - Stephen B Montgomery
- Department of Pathology, Stanford University, Stanford, CA, 94305, USA
- Department of Genetics, Stanford University, Stanford, CA, 94305, USA
| | - Christopher DeBoever
- Institute of Genomic Medicine, University of California San Diego, 9500 Gilman Dr, La Jolla, CA, 92093, USA
| | - Kelly A Frazer
- Department of Pediatrics, University of California San Diego, La Jolla, CA, 92093, USA.
- Institute of Genomic Medicine, University of California San Diego, 9500 Gilman Dr, La Jolla, CA, 92093, USA.
| |
Collapse
|
33
|
Heller D, Vingron M. SVIM: structural variant identification using mapped long reads. Bioinformatics 2020; 35:2907-2915. [PMID: 30668829 PMCID: PMC6735718 DOI: 10.1093/bioinformatics/btz041] [Citation(s) in RCA: 154] [Impact Index Per Article: 38.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/26/2018] [Revised: 01/04/2019] [Accepted: 01/22/2019] [Indexed: 02/07/2023] Open
Abstract
Motivation Structural variants are defined as genomic variants larger than 50 bp. They have been shown to affect more bases in any given genome than single-nucleotide polymorphisms or small insertions and deletions. Additionally, they have great impact on human phenotype and diversity and have been linked to numerous diseases. Due to their size and association with repeats, they are difficult to detect by shotgun sequencing, especially when based on short reads. Long read, single-molecule sequencing technologies like those offered by Pacific Biosciences or Oxford Nanopore Technologies produce reads with a length of several thousand base pairs. Despite the higher error rate and sequencing cost, long-read sequencing offers many advantages for the detection of structural variants. Yet, available software tools still do not fully exploit the possibilities. Results We present SVIM, a tool for the sensitive detection and precise characterization of structural variants from long-read data. SVIM consists of three components for the collection, clustering and combination of structural variant signatures from read alignments. It discriminates five different variant classes including similar types, such as tandem and interspersed duplications and novel element insertions. SVIM is unique in its capability of extracting both the genomic origin and destination of duplications. It compares favorably with existing tools in evaluations on simulated data and real datasets from Pacific Biosciences and Nanopore sequencing machines. Availability and implementation The source code and executables of SVIM are available on Github: github.com/eldariont/svim. SVIM has been implemented in Python 3 and published on bioconda and the Python Package Index. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- David Heller
- Department of Computational Molecular Biology, Max Planck Institute for Molecular Genetics, Berlin, Germany
| | - Martin Vingron
- Department of Computational Molecular Biology, Max Planck Institute for Molecular Genetics, Berlin, Germany
| |
Collapse
|
34
|
Zhou W, Emery SB, Flasch DA, Wang Y, Kwan KY, Kidd JM, Moran JV, Mills RE. Identification and characterization of occult human-specific LINE-1 insertions using long-read sequencing technology. Nucleic Acids Res 2020; 48:1146-1163. [PMID: 31853540 PMCID: PMC7026601 DOI: 10.1093/nar/gkz1173] [Citation(s) in RCA: 51] [Impact Index Per Article: 12.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/07/2019] [Revised: 11/14/2019] [Accepted: 12/05/2019] [Indexed: 11/13/2022] Open
Abstract
Long Interspersed Element-1 (LINE-1) retrotransposition contributes to inter- and intra-individual genetic variation and occasionally can lead to human genetic disorders. Various strategies have been developed to identify human-specific LINE-1 (L1Hs) insertions from short-read whole genome sequencing (WGS) data; however, they have limitations in detecting insertions in complex repetitive genomic regions. Here, we developed a computational tool (PALMER) and used it to identify 203 non-reference L1Hs insertions in the NA12878 benchmark genome. Using PacBio long-read sequencing data, we identified L1Hs insertions that were absent in previous short-read studies (90/203). Approximately 81% (73/90) of the L1Hs insertions reside within endogenous LINE-1 sequences in the reference assembly and the analysis of unique breakpoint junction sequences revealed 63% (57/90) of these L1Hs insertions could be genotyped in 1000 Genomes Project sequences. Moreover, we observed that amplification biases encountered in single-cell WGS experiments led to a wide variation in L1Hs insertion detection rates between four individual NA12878 cells; under-amplification limited detection to 32% (65/203) of insertions, whereas over-amplification increased false positive calls. In sum, these data indicate that L1Hs insertions are often missed using standard short-read sequencing approaches and long-read sequencing approaches can significantly improve the detection of L1Hs insertions present in individual genomes.
Collapse
Affiliation(s)
- Weichen Zhou
- Department of Computational Medicine and Bioinformatics, University of Michigan Medical School, 100 Washtenaw Avenue, Ann Arbor, MI 48109, USA
| | - Sarah B Emery
- Department of Human Genetics, University of Michigan Medical School, 1241 East Catherine Street, Ann Arbor, MI 48109, USA
| | - Diane A Flasch
- Department of Human Genetics, University of Michigan Medical School, 1241 East Catherine Street, Ann Arbor, MI 48109, USA
| | - Yifan Wang
- Department of Human Genetics, University of Michigan Medical School, 1241 East Catherine Street, Ann Arbor, MI 48109, USA
| | - Kenneth Y Kwan
- Department of Human Genetics, University of Michigan Medical School, 1241 East Catherine Street, Ann Arbor, MI 48109, USA.,Molecular and Behavioral Neuroscience Institute, University of Michigan Medical School, 109 Zina Pitcher Place, Ann Arbor, MI 48109, USA
| | - Jeffrey M Kidd
- Department of Computational Medicine and Bioinformatics, University of Michigan Medical School, 100 Washtenaw Avenue, Ann Arbor, MI 48109, USA.,Department of Human Genetics, University of Michigan Medical School, 1241 East Catherine Street, Ann Arbor, MI 48109, USA
| | - John V Moran
- Department of Human Genetics, University of Michigan Medical School, 1241 East Catherine Street, Ann Arbor, MI 48109, USA.,Department of Internal Medicine, University of Michigan, 1500 East Medical Center Drive, Ann Arbor, MI 48109, USA
| | - Ryan E Mills
- Department of Computational Medicine and Bioinformatics, University of Michigan Medical School, 100 Washtenaw Avenue, Ann Arbor, MI 48109, USA.,Department of Human Genetics, University of Michigan Medical School, 1241 East Catherine Street, Ann Arbor, MI 48109, USA
| |
Collapse
|
35
|
Wu Z, Wu Y, Gao J. InvBFM: finding genomic inversions from high-throughput sequence data based on feature mining. BMC Genomics 2020; 21:173. [PMID: 32138660 PMCID: PMC7057458 DOI: 10.1186/s12864-020-6585-1] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/22/2020] [Accepted: 02/17/2020] [Indexed: 12/03/2022] Open
Abstract
Background Genomic inversion is one type of structural variations (SVs) and is known to play an important biological role. An established problem in sequence data analysis is calling inversions from high-throughput sequence data. It is more difficult to detect inversions because they are surrounded by duplication or other types of SVs in the inversion areas. Existing inversion detection tools are mainly based on three approaches: paired-end reads, split-mapped reads, and assembly. However, existing tools suffer from unsatisfying precision or sensitivity (eg: only 50~60% sensitivity) and it needs to be improved. Result In this paper, we present a new inversion calling method called InvBFM. InvBFM calls inversions based on feature mining. InvBFM first gathers the results of existing inversion detection tools as candidates for inversions. It then extracts features from the inversions. Finally, it calls the true inversions by a trained support vector machine (SVM) classifier. Conclusions Our results on real sequence data from the 1000 Genomes Project show that by combining feature mining and a machine learning model, InvBFM outperforms existing tools. InvBFM is written in Python and Shell and is available for download at https://github.com/wzj1234/InvBFM.
Collapse
Affiliation(s)
- Zhongjia Wu
- College of Information Science and Technology, Beijing University of Chemical Technology, Beijing, People's Republic of China
| | - Yufeng Wu
- Department of Computer Science and Engineering, University of Connecticut, Storrs, Connecticut, USA
| | - Jingyang Gao
- College of Information Science and Technology, Beijing University of Chemical Technology, Beijing, People's Republic of China.
| |
Collapse
|
36
|
Tham CY, Tirado-Magallanes R, Goh Y, Fullwood MJ, Koh BTH, Wang W, Ng CH, Chng WJ, Thiery A, Tenen DG, Benoukraf T. NanoVar: accurate characterization of patients' genomic structural variants using low-depth nanopore sequencing. Genome Biol 2020; 21:56. [PMID: 32127024 PMCID: PMC7055087 DOI: 10.1186/s13059-020-01968-7] [Citation(s) in RCA: 52] [Impact Index Per Article: 13.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/03/2019] [Accepted: 02/21/2020] [Indexed: 12/19/2022] Open
Abstract
The recent advent of third-generation sequencing technologies brings promise for better characterization of genomic structural variants by virtue of having longer reads. However, long-read applications are still constrained by their high sequencing error rates and low sequencing throughput. Here, we present NanoVar, an optimized structural variant caller utilizing low-depth (8X) whole-genome sequencing data generated by Oxford Nanopore Technologies. NanoVar exhibits higher structural variant calling accuracy when benchmarked against current tools using low-depth simulated datasets. In patient samples, we successfully validate structural variants characterized by NanoVar and uncover normal alternative sequences or alleles which are present in healthy individuals.
Collapse
Affiliation(s)
- Cheng Yong Tham
- Cancer Science Institute of Singapore, National University of Singapore, Centre for Translational Medicine, 14 Medical Drive, #12-01, Singapore, 117599, Singapore
| | - Roberto Tirado-Magallanes
- Cancer Science Institute of Singapore, National University of Singapore, Centre for Translational Medicine, 14 Medical Drive, #12-01, Singapore, 117599, Singapore
| | - Yufen Goh
- Cancer Science Institute of Singapore, National University of Singapore, Centre for Translational Medicine, 14 Medical Drive, #12-01, Singapore, 117599, Singapore
| | - Melissa J Fullwood
- Cancer Science Institute of Singapore, National University of Singapore, Centre for Translational Medicine, 14 Medical Drive, #12-01, Singapore, 117599, Singapore.,School of Biological Sciences, Nanyang Technological University, Singapore, 637551, Singapore
| | - Bryan T H Koh
- Department of Orthopedic Surgery, National University Health Systems, Singapore, 119228, Singapore
| | - Wilson Wang
- Department of Orthopedic Surgery, National University Health Systems, Singapore, 119228, Singapore.,Department of Orthopaedic Surgery, Yong Loo Lin School of Medicine, National University of Singapore, Singapore, 119228, Singapore
| | - Chin Hin Ng
- Department of Hematology-Oncology, National University Cancer Institute of Singapore, National University Health System, Singapore, 119228, Singapore
| | - Wee Joo Chng
- Cancer Science Institute of Singapore, National University of Singapore, Centre for Translational Medicine, 14 Medical Drive, #12-01, Singapore, 117599, Singapore.,Department of Hematology-Oncology, National University Cancer Institute of Singapore, National University Health System, Singapore, 119228, Singapore.,Department of Medicine, Yong Loo Lin School of Medicine, National University of Singapore, Singapore, 119228, Singapore
| | - Alexandre Thiery
- Department of Statistics and Applied Probability, National University of Singapore, Singapore, 117546, Singapore
| | - Daniel G Tenen
- Cancer Science Institute of Singapore, National University of Singapore, Centre for Translational Medicine, 14 Medical Drive, #12-01, Singapore, 117599, Singapore.,Harvard Stem Cell Institute, Harvard Medical School, Boston, MA, 02115, USA
| | - Touati Benoukraf
- Cancer Science Institute of Singapore, National University of Singapore, Centre for Translational Medicine, 14 Medical Drive, #12-01, Singapore, 117599, Singapore. .,Discipline of Genetics, Faculty of Medicine, Memorial University of Newfoundland, St. John's, NL, A1B 3V6, Canada.
| |
Collapse
|
37
|
Abstract
Identifying structural variation (SV) is essential for genome interpretation but has been historically difficult due to limitations inherent to available genome technologies. Detection methods that use ensemble algorithms and emerging sequencing technologies have enabled the discovery of thousands of SVs, uncovering information about their ubiquity, relationship to disease and possible effects on biological mechanisms. Given the variability in SV type and size, along with unique detection biases of emerging genomic platforms, multiplatform discovery is necessary to resolve the full spectrum of variation. Here, we review modern approaches for investigating SVs and proffer that, moving forwards, studies integrating biological information with detection will be necessary to comprehensively understand the impact of SV in the human genome.
Collapse
Affiliation(s)
- Steve S Ho
- Department of Human Genetics, University of Michigan, Ann Arbor, MI, USA
| | - Alexander E Urban
- Department of Psychiatry and Behavioral Sciences, Stanford University School of Medicine, Stanford, CA, USA
- Department of Genetics, Stanford University School of Medicine, Stanford, CA, USA
| | - Ryan E Mills
- Department of Human Genetics, University of Michigan, Ann Arbor, MI, USA.
- Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, MI, USA.
| |
Collapse
|
38
|
Loiseau V, Herniou EA, Moreau Y, Lévêque N, Meignin C, Daeffler L, Federici B, Cordaux R, Gilbert C. Wide spectrum and high frequency of genomic structural variation, including transposable elements, in large double-stranded DNA viruses. Virus Evol 2020; 6:vez060. [PMID: 32002191 PMCID: PMC6983493 DOI: 10.1093/ve/vez060] [Citation(s) in RCA: 20] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/06/2023] Open
Abstract
Our knowledge of the diversity and frequency of genomic structural variation segregating in populations of large double-stranded (ds) DNA viruses is limited. Here, we sequenced the genome of a baculovirus (Autographa californica multiple nucleopolyhedrovirus [AcMNPV]) purified from beet armyworm (Spodoptera exigua) larvae at depths >195,000× using both short- (Illumina) and long-read (PacBio) technologies. Using a pipeline relying on hierarchical clustering of structural variants (SVs) detected in individual short- and long-reads by six variant callers, we identified a total of 1,141 SVs in AcMNPV, including 464 deletions, 443 inversions, 160 duplications, and 74 insertions. These variants are considered robust and unlikely to result from technical artifacts because they were independently detected in at least three long reads as well as at least three short reads. SVs are distributed along the entire AcMNPV genome and may involve large genomic regions (30,496 bp on average). We show that no less than 39.9 per cent of genomes carry at least one SV in AcMNPV populations, that the vast majority of SVs (75%) segregate at very low frequency (<0.01%) and that very few SVs persist after ten replication cycles, consistent with a negative impact of most SVs on AcMNPV fitness. Using short-read sequencing datasets, we then show that populations of two iridoviruses and one herpesvirus are also full of SVs, as they contain between 426 and 1,102 SVs carried by 52.4–80.1 per cent of genomes. Finally, AcMNPV long reads allowed us to identify 1,757 transposable elements (TEs) insertions, 895 of which are truncated and occur at one extremity of the reads. This further supports the role of baculoviruses as possible vectors of horizontal transfer of TEs. Altogether, we found that SVs, which evolve mostly under rapid dynamics of gain and loss in viral populations, represent an important feature in the biology of large dsDNA viruses.
Collapse
Affiliation(s)
- Vincent Loiseau
- Laboratoire Evolution, Génomes, Comportement, Écologie, Unité Mixte de Recherche 9191 Centre National de la Recherche Scientifique et Unité Mixte de Recherche 247 Institut de Recherche pour le Développement, Université Paris-Saclay, Gif-sur-Yvette 91198, France
| | - Elisabeth A Herniou
- Institut de Recherche sur la Biologie de l'Insecte, UMR 7261 CNRS - Université de Tours, 37200 Tours, France
| | - Yannis Moreau
- Institut de Recherche sur la Biologie de l'Insecte, UMR 7261 CNRS - Université de Tours, 37200 Tours, France
| | - Nicolas Lévêque
- Laboratoire de Virologie et Mycobactériologie, CHU de Poitiers, 86000 Poitiers, France.,Laboratoire Inflammation, Tissus Epithéliaux et Cytokines, EA 4331, Université de Poitiers, 86000 Poitiers, France
| | - Carine Meignin
- Modèles Insectes d'Immunité Innée (M3i), Université de Strasbourg, IBMC CNRS-UPR9022, Strasbourg F-67000, France
| | - Laurent Daeffler
- Modèles Insectes d'Immunité Innée (M3i), Université de Strasbourg, IBMC CNRS-UPR9022, Strasbourg F-67000, France
| | - Brian Federici
- Department of Entomology and Institute for Integrative Genome Biology, University of California, Riverside, CA 92521, USA
| | - Richard Cordaux
- Laboratoire Ecologie et Biologie des Interactions, Equipe Ecologie Evolution Symbiose, Unité Mixte de Recherche 7267 Centre National de la Recherche Scientifique, Université de Poitiers, 86000 Poitiers, France
| | - Clément Gilbert
- Laboratoire Evolution, Génomes, Comportement, Écologie, Unité Mixte de Recherche 9191 Centre National de la Recherche Scientifique et Unité Mixte de Recherche 247 Institut de Recherche pour le Développement, Université Paris-Saclay, Gif-sur-Yvette 91198, France
| |
Collapse
|
39
|
Alzaid E, Allali AE. PostSV: A Post-Processing Approach for Filtering Structural Variations. Bioinform Biol Insights 2020; 14:1177932219892957. [PMID: 32009779 PMCID: PMC6974750 DOI: 10.1177/1177932219892957] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/03/2019] [Accepted: 11/09/2019] [Indexed: 11/25/2022] Open
Abstract
Genomic structural variations are significant causes of genome diversity and
complex diseases. With advances in sequencing technologies, many algorithms have
been designed to identify structural differences using next-generation
sequencing (NGS) data. Due to repetitions in the human genome and the short
reads produced by NGS, the discovery of structural variants (SVs) by
state-of-the-art SV callers is not always accurate. To improve performance,
multiple SV callers are often used to detect variants. However, most SV callers
suffer from high false-positive rates, which diminishes the overall performance,
especially in low-coverage genomes. In this article, we propose a
post-processing classification–based algorithm that can be used to filter
structural variation predictions produced by SV callers. Novel features are
defined from putative SV predictions using reads at the local regions around the
breakpoints. Several classifiers are employed to classify the candidate
predictions and remove false positives. We test our classifier models on
simulated and real genomes and show that the proposed approach improves the
performance of state-of-the-art algorithms.
Collapse
Affiliation(s)
- Eman Alzaid
- Computer Science Department, King Saud University, Riyadh, Saudi Arabia.,Department of Computer Science, College of Computer and Information Sciences, Imam Mohammad Ibn Saud Islamic University, Riyadh, Saudi Arabia
| | - Achraf El Allali
- Computer Science Department, King Saud University, Riyadh, Saudi Arabia
| |
Collapse
|
40
|
Kuzniar A, Maassen J, Verhoeven S, Santuari L, Shneider C, Kloosterman WP, de Ridder J. sv-callers: a highly portable parallel workflow for structural variant detection in whole-genome sequence data. PeerJ 2020; 8:e8214. [PMID: 31934500 PMCID: PMC6951283 DOI: 10.7717/peerj.8214] [Citation(s) in RCA: 12] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/15/2019] [Accepted: 11/14/2019] [Indexed: 12/19/2022] Open
Abstract
Structural variants (SVs) are an important class of genetic variation implicated in a wide array of genetic diseases including cancer. Despite the advances in whole genome sequencing, comprehensive and accurate detection of SVs in short-read data still poses some practical and computational challenges. We present sv-callers, a highly portable workflow that enables parallel execution of multiple SV detection tools, as well as provide users with example analyses of detected SV callsets in a Jupyter Notebook. This workflow supports easy deployment of software dependencies, configuration and addition of new analysis tools. Moreover, porting it to different computing systems requires minimal effort. Finally, we demonstrate the utility of the workflow by performing both somatic and germline SV analyses on different high-performance computing systems.
Collapse
Affiliation(s)
| | | | | | - Luca Santuari
- Center for Molecular Medicine, University Medical Center Utrecht, Utrecht, Netherlands
| | - Carl Shneider
- Center for Molecular Medicine, University Medical Center Utrecht, Utrecht, Netherlands
| | - Wigard P Kloosterman
- Center for Molecular Medicine, University Medical Center Utrecht, Utrecht, Netherlands
| | - Jeroen de Ridder
- Center for Molecular Medicine, University Medical Center Utrecht, Utrecht, Netherlands
| |
Collapse
|
41
|
Eggertsson HP, Kristmundsdottir S, Beyter D, Jonsson H, Skuladottir A, Hardarson MT, Gudbjartsson DF, Stefansson K, Halldorsson BV, Melsted P. GraphTyper2 enables population-scale genotyping of structural variation using pangenome graphs. Nat Commun 2019; 10:5402. [PMID: 31776332 PMCID: PMC6881350 DOI: 10.1038/s41467-019-13341-9] [Citation(s) in RCA: 67] [Impact Index Per Article: 13.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/23/2019] [Accepted: 10/30/2019] [Indexed: 12/31/2022] Open
Abstract
Analysis of sequence diversity in the human genome is fundamental for genetic studies. Structural variants (SVs) are frequently omitted in sequence analysis studies, although each has a relatively large impact on the genome. Here, we present GraphTyper2, which uses pangenome graphs to genotype SVs and small variants using short-reads. Comparison to the syndip benchmark dataset shows that our SV genotyping is sensitive and variant segregation in families demonstrates the accuracy of our approach. We demonstrate that incorporating public assembly data into our pipeline greatly improves sensitivity, particularly for large insertions. We validate 6,812 SVs on average per genome using long-read data of 41 Icelanders. We show that GraphTyper2 can simultaneously genotype tens of thousands of whole-genomes by characterizing 60 million small variants and half a million SVs in 49,962 Icelanders, including 80 thousand SVs with high-confidence.
Collapse
Affiliation(s)
- Hannes P Eggertsson
- deCODE genetics/Amgen Inc., Sturlugata 8, Reykjavik, Iceland.
- School of Engineering and Natural Sciences, University of Iceland, Reykjavik, Iceland.
| | - Snaedis Kristmundsdottir
- deCODE genetics/Amgen Inc., Sturlugata 8, Reykjavik, Iceland
- School of Science and Engineering, Reykjavik University, Reykjavik, Iceland
| | - Doruk Beyter
- deCODE genetics/Amgen Inc., Sturlugata 8, Reykjavik, Iceland
| | - Hakon Jonsson
- deCODE genetics/Amgen Inc., Sturlugata 8, Reykjavik, Iceland
| | | | | | - Daniel F Gudbjartsson
- deCODE genetics/Amgen Inc., Sturlugata 8, Reykjavik, Iceland
- School of Engineering and Natural Sciences, University of Iceland, Reykjavik, Iceland
| | - Kari Stefansson
- deCODE genetics/Amgen Inc., Sturlugata 8, Reykjavik, Iceland
- Faculty of Medicine, School of Health Sciences, University of Iceland, Reykjavik, Iceland
| | - Bjarni V Halldorsson
- deCODE genetics/Amgen Inc., Sturlugata 8, Reykjavik, Iceland.
- School of Science and Engineering, Reykjavik University, Reykjavik, Iceland.
| | - Pall Melsted
- deCODE genetics/Amgen Inc., Sturlugata 8, Reykjavik, Iceland.
- School of Engineering and Natural Sciences, University of Iceland, Reykjavik, Iceland.
| |
Collapse
|
42
|
Zhou A, Lin T, Xing J. Evaluating nanopore sequencing data processing pipelines for structural variation identification. Genome Biol 2019; 20:237. [PMID: 31727126 PMCID: PMC6857234 DOI: 10.1186/s13059-019-1858-1] [Citation(s) in RCA: 24] [Impact Index Per Article: 4.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/18/2019] [Accepted: 10/10/2019] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Structural variations (SVs) account for about 1% of the differences among human genomes and play a significant role in phenotypic variation and disease susceptibility. The emerging nanopore sequencing technology can generate long sequence reads and can potentially provide accurate SV identification. However, the tools for aligning long-read data and detecting SVs have not been thoroughly evaluated. RESULTS Using four nanopore datasets, including both empirical and simulated reads, we evaluate four alignment tools and three SV detection tools. We also evaluate the impact of sequencing depth on SV detection. Finally, we develop a machine learning approach to integrate call sets from multiple pipelines. Overall SV callers' performance varies depending on the SV types. For an initial data assessment, we recommend using aligner minimap2 in combination with SV caller Sniffles because of their speed and relatively balanced performance. For detailed analysis, we recommend incorporating information from multiple call sets to improve the SV call performance. CONCLUSIONS We present a workflow for evaluating aligners and SV callers for nanopore sequencing data and approaches for integrating multiple call sets. Our results indicate that additional optimizations are needed to improve SV detection accuracy and sensitivity, and an integrated call set can provide enhanced performance. The nanopore technology is improving, and the sequencing community is likely to grow accordingly. In turn, better benchmark call sets will be available to more accurately assess the performance of available tools and facilitate further tool development.
Collapse
Affiliation(s)
- Anbo Zhou
- Department of Genetics, Rutgers, the State University of New Jersey, Piscataway, NJ, 08854, USA
| | - Timothy Lin
- Department of Genetics, Rutgers, the State University of New Jersey, Piscataway, NJ, 08854, USA
| | - Jinchuan Xing
- Department of Genetics, Rutgers, the State University of New Jersey, Piscataway, NJ, 08854, USA.
- Human Genetics Institute of New Jersey, Rutgers, the State University of New Jersey, Piscataway, NJ, 08854, USA.
| |
Collapse
|
43
|
Kómár P, Kural D. geck: trio-based comparative benchmarking of variant calls. Bioinformatics 2019; 34:3488-3495. [PMID: 29850774 PMCID: PMC6184596 DOI: 10.1093/bioinformatics/bty415] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/03/2017] [Accepted: 05/22/2018] [Indexed: 12/30/2022] Open
Abstract
Motivation Classical methods of comparing the accuracies of variant calling pipelines are based on truth sets of variants whose genotypes are previously determined with high confidence. An alternative way of performing benchmarking is based on Mendelian constraints between related individuals. Statistical analysis of Mendelian violations can provide truth set-independent benchmarking information, and enable benchmarking less-studied variants and diverse populations. Results We introduce a statistical mixture model for comparing two variant calling pipelines from genotype data they produce after running on individual members of a trio. We determine the accuracy of our model by comparing the precision and recall of GATK Unified Genotyper and Haplotype Caller on the high-confidence SNPs of the NIST Ashkenazim trio and the two independent Platinum Genome trios. We show that our method is able to estimate differential precision and recall between the two pipelines with 10-3 uncertainty. Availability and implementation The Python library geck, and usage examples are available at the following URL: https://github.com/sbg/geck, under the GNU General Public License v3. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
|
44
|
Comprehensive evaluation and characterisation of short read general-purpose structural variant calling software. Nat Commun 2019; 10:3240. [PMID: 31324872 PMCID: PMC6642177 DOI: 10.1038/s41467-019-11146-4] [Citation(s) in RCA: 137] [Impact Index Per Article: 27.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/14/2018] [Accepted: 06/26/2019] [Indexed: 01/12/2023] Open
Abstract
In recent years, many software packages for identifying structural variants (SVs) using whole-genome sequencing data have been released. When published, a new method is commonly compared with those already available, but this tends to be selective and incomplete. The lack of comprehensive benchmarking of methods presents challenges for users in selecting methods and for developers in understanding algorithm behaviours and limitations. Here we report the comprehensive evaluation of 10 SV callers, selected following a rigorous process and spanning the breadth of detection approaches, using high-quality reference cell lines, as well as simulations. Due to the nature of available truth sets, our focus is on general-purpose rather than somatic callers. We characterise the impact on performance of event size and type, sequencing characteristics, and genomic context, and analyse the efficacy of ensemble calling and calibration of variant quality scores. Finally, we provide recommendations for both users and methods developers. A number of computational methods have been developed for calling structural variants (SVs) using short read sequencing data. Here, the authors perform a comprehensive benchmarking analysis comparing 10 general-purpose callers and provide recommendations for both users and methods developers.
Collapse
|
45
|
Kosugi S, Momozawa Y, Liu X, Terao C, Kubo M, Kamatani Y. Comprehensive evaluation of structural variation detection algorithms for whole genome sequencing. Genome Biol 2019; 20:117. [PMID: 31159850 PMCID: PMC6547561 DOI: 10.1186/s13059-019-1720-5] [Citation(s) in RCA: 232] [Impact Index Per Article: 46.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/15/2018] [Accepted: 05/20/2019] [Indexed: 01/01/2023] Open
Abstract
BACKGROUND Structural variations (SVs) or copy number variations (CNVs) greatly impact the functions of the genes encoded in the genome and are responsible for diverse human diseases. Although a number of existing SV detection algorithms can detect many types of SVs using whole genome sequencing (WGS) data, no single algorithm can call every type of SVs with high precision and high recall. RESULTS We comprehensively evaluate the performance of 69 existing SV detection algorithms using multiple simulated and real WGS datasets. The results highlight a subset of algorithms that accurately call SVs depending on specific types and size ranges of the SVs and that accurately determine breakpoints, sizes, and genotypes of the SVs. We enumerate potential good algorithms for each SV category, among which GRIDSS, Lumpy, SVseq2, SoftSV, Manta, and Wham are better algorithms in deletion or duplication categories. To improve the accuracy of SV calling, we systematically evaluate the accuracy of overlapping calls between possible combinations of algorithms for every type and size range of SVs. The results demonstrate that both the precision and recall for overlapping calls vary depending on the combinations of specific algorithms rather than the combinations of methods used in the algorithms. CONCLUSION These results suggest that careful selection of the algorithms for each type and size range of SVs is required for accurate calling of SVs. The selection of specific pairs of algorithms for overlapping calls promises to effectively improve the SV detection accuracy.
Collapse
Affiliation(s)
- Shunichi Kosugi
- Laboratory for Statistical Analysis, RIKEN Center for Integrative Medical Sciences, 1-7-22 Suehiro-cho, Tsurumi-ku, Yokohama, 230-0045 Japan
- Laboratory for Statistical and Translational Genetics, RIKEN Center for Integrative Medical Sciences, 1-7-22 Suehiro-cho, Tsurumi-ku, Yokohama, 230-0045 Japan
| | - Yukihide Momozawa
- Laboratory for Genotyping Development, RIKEN Center for Integrative Medical Sciences, 1-7-22 Suehiro-cho, Tsurumi-ku, Yokohama, 230-0045 Japan
| | - Xiaoxi Liu
- Laboratory for Genotyping Development, RIKEN Center for Integrative Medical Sciences, 1-7-22 Suehiro-cho, Tsurumi-ku, Yokohama, 230-0045 Japan
| | - Chikashi Terao
- Laboratory for Statistical Analysis, RIKEN Center for Integrative Medical Sciences, 1-7-22 Suehiro-cho, Tsurumi-ku, Yokohama, 230-0045 Japan
- Laboratory for Statistical and Translational Genetics, RIKEN Center for Integrative Medical Sciences, 1-7-22 Suehiro-cho, Tsurumi-ku, Yokohama, 230-0045 Japan
| | - Michiaki Kubo
- RIKEN Center for Integrative Medical Sciences, 1-7-22 Suehiro-cho, Tsurumi-ku, Yokohama, 230-0045 Japan
| | - Yoichiro Kamatani
- Laboratory for Statistical Analysis, RIKEN Center for Integrative Medical Sciences, 1-7-22 Suehiro-cho, Tsurumi-ku, Yokohama, 230-0045 Japan
- Laboratory for Statistical and Translational Genetics, RIKEN Center for Integrative Medical Sciences, 1-7-22 Suehiro-cho, Tsurumi-ku, Yokohama, 230-0045 Japan
| |
Collapse
|
46
|
Zhang L, Bai W, Yuan N, Du Z. Comprehensively benchmarking applications for detecting copy number variation. PLoS Comput Biol 2019; 15:e1007069. [PMID: 31136576 PMCID: PMC6555534 DOI: 10.1371/journal.pcbi.1007069] [Citation(s) in RCA: 45] [Impact Index Per Article: 9.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/23/2019] [Revised: 06/07/2019] [Accepted: 05/06/2019] [Indexed: 12/15/2022] Open
Abstract
Motivation: Recently, copy number variation (CNV) has gained considerable interest as a type of genomic variation that plays an important role in complex phenotypes and disease susceptibility. Since a number of CNV detection methods have recently been developed, it is necessary to help investigators choose suitable methods for CNV detection depending on their objectives. For this reason, this study compared ten commonly used CNV detection applications, including CNVnator, ReadDepth, RDXplorer, LUMPY and Control-FREEC, benchmarking the applications by sensitivity, specificity and computational demands. Taking the DGV gold standard variants as a standard dataset, we evaluated the ten applications with real sequencing data at sequencing depths from 5X to 50X. Among the ten methods benchmarked, LUMPY performs the best for both high sensitivity and specificity at each sequencing depth. For the purpose of high specificity, Canvas is also a good choice. If high sensitivity is preferred, CNVnator and RDXplorer are better choices. Additionally, CNVnator and GROM-RD perform well for low-depth sequencing data. Our results provide a comprehensive performance evaluation for these selected CNV detection methods and facilitate future development and improvement in CNV prediction methods. As an important type of genomic structural variation, CNVs are associated with complex phenotypes because they change the number of copies of genes in cells, affecting coding sequences and playing an important role in the susceptibility or resistance to human diseases. To identify CNVs, several experimental methods have been developed, but their resolution is very low, and the detection of short CNVs presents a bottleneck. In recent years, the advancement of high-throughput sequencing techniques has made it possible to precisely detect CNVs, especially short ones. Many CNV detection applications were developed based on the availability of high-throughput sequencing data. Due to different CNV detection algorithms, the CNVs identified by different applications vary greatly. Therefore, it is necessary to help investigators choose suitable applications for CNV detection depending upon their objectives. For this reason, we not only compared ten commonly used CNV detection applications but also benchmarked the applications by sensitivity, specificity and computational demands. Our results show that the sequencing depth can strongly affect CNV detection. Among the ten applications benchmarked, LUMPY performs best for both high sensitivity and specificity for each sequencing depth. We also give recommended applications for specific purposes, for example, CNVnator and RDXplorer for high sensitivity and CNVnator and GROM-RD for low-depth sequencing data.
Collapse
Affiliation(s)
- Le Zhang
- College of Computer Science, Sichuan University, Chengdu, China
- Medical Big Data Center, Sichuan University, Chengdu, China
- Zdmedical, Information polytron Technologies Inc. Chongqing, Chongqing, China
- * E-mail: (LZ); (ZD)
| | - Wanyu Bai
- College of Computer Science, Sichuan University, Chengdu, China
| | - Na Yuan
- BIG Data Center, Beijing Institute of Genomics, Chinese Academy of Sciences, Beijing, PR China
| | - Zhenglin Du
- BIG Data Center, Beijing Institute of Genomics, Chinese Academy of Sciences, Beijing, PR China
- * E-mail: (LZ); (ZD)
| |
Collapse
|
47
|
Bowden R, Davies RW, Heger A, Pagnamenta AT, de Cesare M, Oikkonen LE, Parkes D, Freeman C, Dhalla F, Patel SY, Popitsch N, Ip CLC, Roberts HE, Salatino S, Lockstone H, Lunter G, Taylor JC, Buck D, Simpson MA, Donnelly P. Sequencing of human genomes with nanopore technology. Nat Commun 2019; 10:1869. [PMID: 31015479 PMCID: PMC6478738 DOI: 10.1038/s41467-019-09637-5] [Citation(s) in RCA: 102] [Impact Index Per Article: 20.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/30/2018] [Accepted: 03/19/2019] [Indexed: 12/17/2022] Open
Abstract
Whole-genome sequencing (WGS) is becoming widely used in clinical medicine in diagnostic contexts and to inform treatment choice. Here we evaluate the potential of the Oxford Nanopore Technologies (ONT) MinION long-read sequencer for routine WGS by sequencing the reference sample NA12878 and the genome of an individual with ataxia-pancytopenia syndrome and severe immune dysregulation. We develop and apply a novel reference panel-free analytical method to infer and then exploit phase information which improves single-nucleotide variant (SNV) calling performance from otherwise modest levels. In the clinical sample, we identify and directly phase two non-synonymous de novo variants in SAMD9L, (OMIM #159550) inferring that they lie on the same paternal haplotype. Whilst consensus SNV-calling error rates from ONT data remain substantially higher than those from short-read methods, we demonstrate the substantial benefits of analytical innovation. Ongoing improvements to base-calling and SNV-calling methodology must continue for nanopore sequencing to establish itself as a primary method for clinical WGS.
Collapse
Affiliation(s)
- Rory Bowden
- Wellcome Centre for Human Genetics, University of Oxford, Oxford, OX3 7BN, UK
| | - Robert W Davies
- Genomics plc, Oxford, OX1 1JD, UK
- Program in Genetics and Genomic Biology and The Centre for Applied Genomics, Hospital for Sick Children, Toronto, M5G 0A4, Canada
| | | | - Alistair T Pagnamenta
- Wellcome Centre for Human Genetics, University of Oxford, Oxford, OX3 7BN, UK
- National Institute for Health Research Oxford Biomedical Research Centre, Oxford, OX4 2PG, UK
| | | | - Laura E Oikkonen
- Wellcome Centre for Human Genetics, University of Oxford, Oxford, OX3 7BN, UK
| | - Duncan Parkes
- Wellcome Centre for Human Genetics, University of Oxford, Oxford, OX3 7BN, UK
| | - Colin Freeman
- Wellcome Centre for Human Genetics, University of Oxford, Oxford, OX3 7BN, UK
| | - Fatima Dhalla
- Department of Clinical Immunology, Oxford University Hospitals, Oxford, OX3 9DU, UK
- Developmental Immunology Group, MRC Weatherall Institute of Molecular Medicine, University of Oxford, Oxford, OX3 9DS, UK
| | - Smita Y Patel
- Department of Clinical Immunology, Oxford University Hospitals, Oxford, OX3 9DU, UK
- Clinical Immunology Group, National Institute for Health Research Oxford Biomedical Research Centre, Oxford, OX4 2PG, UK
| | - Niko Popitsch
- Wellcome Centre for Human Genetics, University of Oxford, Oxford, OX3 7BN, UK
- National Institute for Health Research Oxford Biomedical Research Centre, Oxford, OX4 2PG, UK
- Children's Cancer Research Institute, St. Anna Kinderkrebsforschung, 1090, Vienna, Austria
| | - Camilla L C Ip
- Wellcome Centre for Human Genetics, University of Oxford, Oxford, OX3 7BN, UK
| | - Hannah E Roberts
- Wellcome Centre for Human Genetics, University of Oxford, Oxford, OX3 7BN, UK
| | - Silvia Salatino
- Wellcome Centre for Human Genetics, University of Oxford, Oxford, OX3 7BN, UK
| | - Helen Lockstone
- Wellcome Centre for Human Genetics, University of Oxford, Oxford, OX3 7BN, UK
| | - Gerton Lunter
- Wellcome Centre for Human Genetics, University of Oxford, Oxford, OX3 7BN, UK
- Genomics plc, Oxford, OX1 1JD, UK
| | - Jenny C Taylor
- Wellcome Centre for Human Genetics, University of Oxford, Oxford, OX3 7BN, UK
- National Institute for Health Research Oxford Biomedical Research Centre, Oxford, OX4 2PG, UK
| | - David Buck
- Wellcome Centre for Human Genetics, University of Oxford, Oxford, OX3 7BN, UK
| | | | - Peter Donnelly
- Wellcome Centre for Human Genetics, University of Oxford, Oxford, OX3 7BN, UK.
- Genomics plc, Oxford, OX1 1JD, UK.
- Department of Statistics, University of Oxford, Oxford, OX1 3LB, UK.
| |
Collapse
|
48
|
Marks P, Garcia S, Barrio AM, Belhocine K, Bernate J, Bharadwaj R, Bjornson K, Catalanotti C, Delaney J, Fehr A, Fiddes IT, Galvin B, Heaton H, Herschleb J, Hindson C, Holt E, Jabara CB, Jett S, Keivanfar N, Kyriazopoulou-Panagiotopoulou S, Lek M, Lin B, Lowe A, Mahamdallie S, Maheshwari S, Makarewicz T, Marshall J, Meschi F, O'Keefe CJ, Ordonez H, Patel P, Price A, Royall A, Ruark E, Seal S, Schnall-Levin M, Shah P, Stafford D, Williams S, Wu I, Xu AW, Rahman N, MacArthur D, Church DM. Resolving the full spectrum of human genome variation using Linked-Reads. Genome Res 2019; 29:635-645. [PMID: 30894395 PMCID: PMC6442396 DOI: 10.1101/gr.234443.118] [Citation(s) in RCA: 123] [Impact Index Per Article: 24.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/09/2018] [Accepted: 02/21/2019] [Indexed: 02/07/2023]
Abstract
Large-scale population analyses coupled with advances in technology have demonstrated that the human genome is more diverse than originally thought. To date, this diversity has largely been uncovered using short-read whole-genome sequencing. However, these short-read approaches fail to give a complete picture of a genome. They struggle to identify structural events, cannot access repetitive regions, and fail to resolve the human genome into haplotypes. Here, we describe an approach that retains long range information while maintaining the advantages of short reads. Starting from ∼1 ng of high molecular weight DNA, we produce barcoded short-read libraries. Novel informatic approaches allow for the barcoded short reads to be associated with their original long molecules producing a novel data type known as "Linked-Reads". This approach allows for simultaneous detection of small and large variants from a single library. In this manuscript, we show the advantages of Linked-Reads over standard short-read approaches for reference-based analysis. Linked-Reads allow mapping to 38 Mb of sequence not accessible to short reads, adding sequence in 423 difficult-to-sequence genes including disease-relevant genes STRC, SMN1, and SMN2 Both Linked-Read whole-genome and whole-exome sequencing identify complex structural variations, including balanced events and single exon deletions and duplications. Further, Linked-Reads extend the region of high-confidence calls by 68.9 Mb. The data presented here show that Linked-Reads provide a scalable approach for comprehensive genome analysis that is not possible using short reads alone.
Collapse
Affiliation(s)
| | | | | | | | | | | | | | | | | | - Adrian Fehr
- 10x Genomics, Pleasanton, California 94566, USA
| | | | | | | | | | | | - Esty Holt
- The Institute of Cancer Research, Division of Genetics and Epidemiology, London SM2 5NG, United Kingdom
| | | | | | | | | | - Monkol Lek
- Analytic and Translational Genetics Unit, Massachusetts General Hospital, Boston, Massachusetts 02114, USA
- Program in Medical and Population Genetics, Broad Institute of MIT and Harvard, Cambridge, Massachusetts 02142, USA
| | - Bill Lin
- 10x Genomics, Pleasanton, California 94566, USA
| | - Adam Lowe
- 10x Genomics, Pleasanton, California 94566, USA
| | - Shazia Mahamdallie
- The Institute of Cancer Research, Division of Genetics and Epidemiology, London SM2 5NG, United Kingdom
| | | | | | - Jamie Marshall
- Program in Medical and Population Genetics, Broad Institute of MIT and Harvard, Cambridge, Massachusetts 02142, USA
| | | | | | | | | | | | | | - Elise Ruark
- The Institute of Cancer Research, Division of Genetics and Epidemiology, London SM2 5NG, United Kingdom
| | - Sheila Seal
- The Institute of Cancer Research, Division of Genetics and Epidemiology, London SM2 5NG, United Kingdom
| | | | - Preyas Shah
- 10x Genomics, Pleasanton, California 94566, USA
| | | | | | - Indira Wu
- 10x Genomics, Pleasanton, California 94566, USA
| | | | - Nazneen Rahman
- The Institute of Cancer Research, Division of Genetics and Epidemiology, London SM2 5NG, United Kingdom
| | - Daniel MacArthur
- Analytic and Translational Genetics Unit, Massachusetts General Hospital, Boston, Massachusetts 02114, USA
- Program in Medical and Population Genetics, Broad Institute of MIT and Harvard, Cambridge, Massachusetts 02142, USA
| | | |
Collapse
|
49
|
Zhou B, Arthur JG, Ho SS, Pattni R, Huang Y, Wong WH, Urban AE. Extensive and deep sequencing of the Venter/HuRef genome for developing and benchmarking genome analysis tools. Sci Data 2018; 5:180261. [PMID: 30561434 PMCID: PMC6298255 DOI: 10.1038/sdata.2018.261] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/23/2018] [Accepted: 10/04/2018] [Indexed: 12/30/2022] Open
Abstract
We produced an extensive collection of deep re-sequencing datasets for the Venter/HuRef genome using the Illumina massively-parallel DNA sequencing platform. The original Venter genome sequence is a very-high quality phased assembly based on Sanger sequencing. Therefore, researchers developing novel computational tools for the analysis of human genome sequence variation for the dominant Illumina sequencing technology can test and hone their algorithms by making variant calls from these Venter/HuRef datasets and then immediately confirm the detected variants in the Sanger assembly, freeing them of the need for further experimental validation. This process also applies to implementing and benchmarking existing genome analysis pipelines. We prepared and sequenced 200 bp and 350 bp short-insert whole-genome sequencing libraries (sequenced to 100x and 40x genomic coverages respectively) as well as 2 kb, 5 kb, and 12 kb mate-pair libraries (49x, 122x, and 145x physical coverages respectively). Lastly, we produced a linked-read library (128x physical coverage) from which we also performed haplotype phasing.
Collapse
Affiliation(s)
- Bo Zhou
- Department of Psychiatry and Behavioral Sciences, Department of Genetics, Stanford University School of Medicine, Stanford, California 94305, USA
| | - Joseph G. Arthur
- Department of Statistics, Department of Biomedical Data Science, Bio-X Program, Stanford University, Stanford, California 94305, USA
| | - Steve S. Ho
- Department of Psychiatry and Behavioral Sciences, Department of Genetics, Stanford University School of Medicine, Stanford, California 94305, USA
| | - Reenal Pattni
- Department of Psychiatry and Behavioral Sciences, Department of Genetics, Stanford University School of Medicine, Stanford, California 94305, USA
| | - Yiling Huang
- Department of Psychiatry and Behavioral Sciences, Department of Genetics, Stanford University School of Medicine, Stanford, California 94305, USA
| | - Wing H. Wong
- Department of Statistics, Department of Biomedical Data Science, Bio-X Program, Stanford University, Stanford, California 94305, USA
| | - Alexander E. Urban
- Department of Psychiatry and Behavioral Sciences, Department of Genetics, Stanford University School of Medicine, Stanford, California 94305, USA
- Tashia and John Morgridge Faculty Scholar, Stanford Child Health Research Institute, Palo Alto, California 94305, USA
| |
Collapse
|
50
|
A Randomized Iterative Approach for SV Discovery with SVelter. Methods Mol Biol 2018. [PMID: 30039372 DOI: 10.1007/978-1-4939-8666-8_13] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register]
Abstract
Genomic structural variants (SVs) are major sources of genome diversity, and numerous studies over the past few decades have shown the impact this class of genetic variation has had on human health and disease. In spite of the recent advances in sequencing technology and discovery methodology, there are still considerable amount of variants in the genome that are partially or completely misinterpreted. The computational tool introduced in this chapter, SVelter, is specifically designed to detect and resolve genomic SVs in all different formats, including the canonical as well as the complex.
Collapse
|