1
|
Chi WY, Hu Y, Huang HC, Kuo HH, Lin SH, Kuo CTJ, Tao J, Fan D, Huang YM, Wu AA, Hung CF, Wu TC. Molecular targets and strategies in the development of nucleic acid cancer vaccines: from shared to personalized antigens. J Biomed Sci 2024; 31:94. [PMID: 39379923 PMCID: PMC11463125 DOI: 10.1186/s12929-024-01082-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/19/2024] [Accepted: 09/01/2024] [Indexed: 10/10/2024] Open
Abstract
Recent breakthroughs in cancer immunotherapies have emphasized the importance of harnessing the immune system for treating cancer. Vaccines, which have traditionally been used to promote protective immunity against pathogens, are now being explored as a method to target cancer neoantigens. Over the past few years, extensive preclinical research and more than a hundred clinical trials have been dedicated to investigating various approaches to neoantigen discovery and vaccine formulations, encouraging development of personalized medicine. Nucleic acids (DNA and mRNA) have become particularly promising platform for the development of these cancer immunotherapies. This shift towards nucleic acid-based personalized vaccines has been facilitated by advancements in molecular techniques for identifying neoantigens, antigen prediction methodologies, and the development of new vaccine platforms. Generating these personalized vaccines involves a comprehensive pipeline that includes sequencing of patient tumor samples, data analysis for antigen prediction, and tailored vaccine manufacturing. In this review, we will discuss the various shared and personalized antigens used for cancer vaccine development and introduce strategies for identifying neoantigens through the characterization of gene mutation, transcription, translation and post translational modifications associated with oncogenesis. In addition, we will focus on the most up-to-date nucleic acid vaccine platforms, discuss the limitations of cancer vaccines as well as provide potential solutions, and raise key clinical and technical considerations in vaccine development.
Collapse
Affiliation(s)
- Wei-Yu Chi
- Physiology, Biophysics and Systems Biology Graduate Program, Weill Cornell Medicine, New York, NY, USA
| | - Yingying Hu
- Tri-Institutional PhD Program in Chemical Biology, Memorial Sloan Kettering Cancer Center, New York, NY, USA
| | - Hsin-Che Huang
- Tri-Institutional PhD Program in Chemical Biology, Memorial Sloan Kettering Cancer Center, New York, NY, USA
| | - Hui-Hsuan Kuo
- Pharmacology PhD Program, Weill Cornell Medicine, New York, NY, USA
| | - Shu-Hong Lin
- Department of Epidemiology, The University of Texas MD Anderson Cancer Center, Houston, TX, USA
- The University of Texas Graduate School of Biomedical Sciences at Houston and MD Anderson Cancer Center, Houston, TX, USA
| | - Chun-Tien Jimmy Kuo
- Division of Pharmaceutics and Pharmacology, College of Pharmacy, The Ohio State University, Columbus, OH, USA
| | - Julia Tao
- Department of Pathology, Johns Hopkins School of Medicine, 1550 Orleans St, CRB II Room 309, Baltimore, MD, 21287, USA
| | - Darrell Fan
- Department of Pathology, Johns Hopkins School of Medicine, 1550 Orleans St, CRB II Room 309, Baltimore, MD, 21287, USA
| | - Yi-Min Huang
- Department of Pathology, Johns Hopkins School of Medicine, 1550 Orleans St, CRB II Room 309, Baltimore, MD, 21287, USA
| | - Annie A Wu
- Department of Pathology, Johns Hopkins School of Medicine, 1550 Orleans St, CRB II Room 309, Baltimore, MD, 21287, USA
| | - Chien-Fu Hung
- Department of Pathology, Johns Hopkins School of Medicine, 1550 Orleans St, CRB II Room 309, Baltimore, MD, 21287, USA
- Department of Oncology, Johns Hopkins School of Medicine, Baltimore, MD, USA
- Department of Obstetrics and Gynecology, Johns Hopkins School of Medicine, Baltimore, MD, USA
| | - T-C Wu
- Department of Pathology, Johns Hopkins School of Medicine, 1550 Orleans St, CRB II Room 309, Baltimore, MD, 21287, USA.
- Department of Oncology, Johns Hopkins School of Medicine, Baltimore, MD, USA.
- Department of Obstetrics and Gynecology, Johns Hopkins School of Medicine, Baltimore, MD, USA.
- Department of Molecular Microbiology and Immunology, Bloomberg School of Public Health, Johns Hopkins School of Medicine, Baltimore, MD, USA.
| |
Collapse
|
2
|
Grant JR, Herman EK, Barlow LD, Miglior F, Schenkel FS, Baes CF, Stothard P. A large structural variant collection in Holstein cattle and associated database for variant discovery, characterization, and application. BMC Genomics 2024; 25:903. [PMID: 39350025 PMCID: PMC11440700 DOI: 10.1186/s12864-024-10812-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/12/2024] [Accepted: 09/19/2024] [Indexed: 10/04/2024] Open
Abstract
BACKGROUND Structural variants (SVs) such as deletions, duplications, and insertions are known to contribute to phenotypic variation but remain challenging to identify and genotype. A more complete, accessible, and assessable collection of SVs will assist efforts to study SV function in cattle and to incorporate SV genotyping into animal evaluation. RESULTS In this work we produced a large and deeply characterized collection of SVs in Holstein cattle using two popular SV callers (Manta and Smoove) and publicly available Illumina whole-genome sequence (WGS) read sets from 310 samples (290 male, 20 female, mean 20X coverage). Manta and Smoove identified 31 K and 68 K SVs, respectively. In total the SVs cover 5% (Manta) and 6% (Smoove) of the reference genome, in contrast to the 1% impacted by SNPs and indels. SV genotypes from each caller were confirmed to accurately recapitulate animal relationships estimated using WGS SNP genotypes from the same dataset, with Manta genotypes outperforming Smoove, and deletions outperforming duplications. To support efforts to link the SVs to phenotypic variation, overlapping and tag SNPs were identified for each SV, using genotype sets extracted from the WGS results corresponding to two bovine SNP chips (BovineSNP50 and BovineHD). 9% (Manta) and 11% (Smoove) of the SVs were found to have overlapping BovineHD panel SNPs, while 21% (Manta) and 9% (Smoove) have BovineHD panel tag SNPs. A custom interactive database ( https://svdb-dc.pslab.ca ) containing the identified sequence variants with extensive annotations, gene feature information, and BAM file content for all SVs was created to enable the evaluation and prioritization of SVs for further study. Illustrative examples involving the genes POPDC3, ORM1, G2E3, FANCI, TFB1M, FOXC2, N4BP2, GSTA3, and COPA show how this resource can be used to find well-supported genic SVs, determine SV breakpoints, design genotyping approaches, and identify processed pseudogenes masquerading as deletions. CONCLUSIONS The resources developed through this study can be used to explore sequence variation in Holstein cattle and to develop strategies for studying SVs of interest. The lack of overlapping and tag SNPs from commonly used SNP chips for most of the SVs suggests that other genotyping approaches will be needed (for example direct genotyping) to understand their potential contributions to phenotype. The included SV genotype assessments point to challenges in characterizing SVs, especially duplications, using short-read data and support ongoing efforts to better characterize cattle genomes through long-read sequencing. Lastly, the identification of previously known functional SVs and additional CDS-overlapping SVs supports the phenotypic relevance of this dataset.
Collapse
Affiliation(s)
- Jason R Grant
- Agricultural, Food & Nutritional Science, University of Alberta, Edmonton, AB, T6G 2P5, Canada
| | - Emily K Herman
- Agricultural, Food & Nutritional Science, University of Alberta, Edmonton, AB, T6G 2P5, Canada
| | - Lael D Barlow
- Agricultural, Food & Nutritional Science, University of Alberta, Edmonton, AB, T6G 2P5, Canada
| | - Filippo Miglior
- Centre for Genetic Improvement of Livestock, Department of Animal Biosciences, University of Guelph, Guelph, ON, Canada
- , Lactanet, Guelph, ON, Canada
| | - Flavio S Schenkel
- Centre for Genetic Improvement of Livestock, Department of Animal Biosciences, University of Guelph, Guelph, ON, Canada
| | - Christine F Baes
- Centre for Genetic Improvement of Livestock, Department of Animal Biosciences, University of Guelph, Guelph, ON, Canada
- Institute of Genetics, Vetsuisse Faculty, University of Bern, Bern, Switzerland
| | - Paul Stothard
- Agricultural, Food & Nutritional Science, University of Alberta, Edmonton, AB, T6G 2P5, Canada.
| |
Collapse
|
3
|
Record CJ, Pipis M, Skorupinska M, Blake J, Poh R, Polke JM, Eggleton K, Nanji T, Zuchner S, Cortese A, Houlden H, Rossor AM, Laura M, Reilly MM. Whole genome sequencing increases the diagnostic rate in Charcot-Marie-Tooth disease. Brain 2024; 147:3144-3156. [PMID: 38481354 PMCID: PMC11370804 DOI: 10.1093/brain/awae064] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/09/2023] [Revised: 01/17/2024] [Accepted: 02/07/2024] [Indexed: 09/04/2024] Open
Abstract
Charcot-Marie-Tooth disease (CMT) is one of the most common and genetically heterogeneous inherited neurological diseases, with more than 130 disease-causing genes. Whole genome sequencing (WGS) has improved diagnosis across genetic diseases, but the diagnostic impact in CMT is yet to be fully reported. We present the diagnostic results from a single specialist inherited neuropathy centre, including the impact of WGS diagnostic testing. Patients were assessed at our specialist inherited neuropathy centre from 2009 to 2023. Genetic testing was performed using single gene testing, next-generation sequencing targeted panels, research whole exome sequencing and WGS and, latterly, WGS through the UK National Health Service. Variants were assessed using the American College of Medical Genetics and Genomics and Association for Clinical Genomic Science criteria. Excluding patients with hereditary ATTR amyloidosis, 1515 patients with a clinical diagnosis of CMT and related disorders were recruited. In summary, 621 patients had CMT1 (41.0%), 294 CMT2 (19.4%), 205 intermediate CMT (CMTi, 13.5%), 139 hereditary motor neuropathy (HMN, 9.2%), 93 hereditary sensory neuropathy (HSN, 6.1%), 38 sensory ataxic neuropathy (2.5%), 72 hereditary neuropathy with liability to pressure palsies (HNPP, 4.8%) and 53 'complex' neuropathy (3.5%). Overall, a genetic diagnosis was reached in 76.9% (1165/1515). A diagnosis was most likely in CMT1 (96.8%, 601/621), followed by CMTi (81.0%, 166/205) and then HSN (69.9%, 65/93). Diagnostic rates remained less than 50% in CMT2, HMN and complex neuropathies. The most common genetic diagnosis was PMP22 duplication (CMT1A; 505/1165, 43.3%), then GJB1 (CMTX1; 151/1165, 13.0%), PMP22 deletion (HNPP; 72/1165, 6.2%) and MFN2 (CMT2A; 46/1165, 3.9%). We recruited 233 cases to the UK 100 000 Genomes Project (100KGP), of which 74 (31.8%) achieved a diagnosis; 28 had been otherwise diagnosed since recruitment, leaving a true diagnostic rate of WGS through the 100KGP of 19.7% (46/233). However, almost half of the solved cases (35/74) received a negative report from the study, and the diagnosis was made through our research access to the WGS data. The overall diagnostic uplift of WGS for the entire cohort was 3.5%. Our diagnostic rate is the highest reported from a single centre and has benefitted from the use of WGS, particularly access to the raw data. However, almost one-quarter of all cases remain unsolved, and a new reference genome and novel technologies will be important to narrow the 'diagnostic gap'.
Collapse
Affiliation(s)
- Christopher J Record
- Department of Neuromuscular Diseases, UCL Queen Square Institute of Neurology, London WC1N 3BG, UK
| | - Menelaos Pipis
- Department of Neuromuscular Diseases, UCL Queen Square Institute of Neurology, London WC1N 3BG, UK
| | - Mariola Skorupinska
- Department of Neuromuscular Diseases, UCL Queen Square Institute of Neurology, London WC1N 3BG, UK
| | - Julian Blake
- Department of Neuromuscular Diseases, UCL Queen Square Institute of Neurology, London WC1N 3BG, UK
- Department of Clinical Neurophysiology, Norfolk and Norwich University Hospital, Norwich NR4 7UY, UK
| | - Roy Poh
- Neurogenetics Laboratory, National Hospital for Neurology and Neurosurgery, London WC1N 3BG, UK
| | - James M Polke
- Neurogenetics Laboratory, National Hospital for Neurology and Neurosurgery, London WC1N 3BG, UK
| | - Kelly Eggleton
- Neurogenetics Laboratory, National Hospital for Neurology and Neurosurgery, London WC1N 3BG, UK
| | - Tina Nanji
- Neurogenetics Laboratory, National Hospital for Neurology and Neurosurgery, London WC1N 3BG, UK
| | - Stephan Zuchner
- Dr. John T. Macdonald Foundation Department of Human Genetics, University of Miami Miller School of Medicine, Miami, FL 33136, USA
- John P. Hussman Institute for Human Genomics, University of Miami Miller School of Medicine, Miami, FL 33136, USA
| | - Andrea Cortese
- Department of Neuromuscular Diseases, UCL Queen Square Institute of Neurology, London WC1N 3BG, UK
| | - Henry Houlden
- Department of Neuromuscular Diseases, UCL Queen Square Institute of Neurology, London WC1N 3BG, UK
| | - Alexander M Rossor
- Department of Neuromuscular Diseases, UCL Queen Square Institute of Neurology, London WC1N 3BG, UK
| | - Matilde Laura
- Department of Neuromuscular Diseases, UCL Queen Square Institute of Neurology, London WC1N 3BG, UK
| | - Mary M Reilly
- Department of Neuromuscular Diseases, UCL Queen Square Institute of Neurology, London WC1N 3BG, UK
| |
Collapse
|
4
|
Ormond C, Ryan NM, Byerley W, Heron EA, Corvin A. Investigating copy number variants in schizophrenia pedigrees using a new consensus pipeline called PECAN. Sci Rep 2024; 14:17518. [PMID: 39080331 PMCID: PMC11289470 DOI: 10.1038/s41598-024-66021-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/19/2023] [Accepted: 06/26/2024] [Indexed: 08/02/2024] Open
Abstract
Copy number variants (CNVs) have been implicated in many human diseases, including psychiatric disorders. Whole genome sequencing offers advantages in CNV calling compared to previous array-based methods. Here we present a robust and transparent CNV calling pipeline, PECAN (PEdigree Copy number vAriaNt calling), for short-read, whole genome sequencing data, comprised of a novel combination of four calling methods and structural variant genotyping. This method is scalable and can incorporate pedigree information to retain lower-confidence CNVs that would otherwise be discarded. We have robustly benchmarked PECAN using gold-standard CNV calls for two well-established evaluation samples, NA12878 and HG002, showing that PECAN performs with high precision and recall on both datasets, outperforming another pedigree-based CNV calling pipeline. As part of this work, we provide a list of high-confidence gold standard CNVs for the NA12878 reference sample, curated from multiple studies. We applied PECAN to a collection of pedigrees multiply affected with schizophrenia and identified a rare deletion that perfectly co-segregates with schizophrenia in one of the pedigrees. The CNV overlaps the gene PITRM1, which has been implicated in a complex phenotype including ataxia, developmental delay, and schizophrenia-like episodes in affected adults.
Collapse
Affiliation(s)
- Cathal Ormond
- Neuropsychiatric Genetics Research Group, Department of Psychiatry, Trinity Centre for Health Sciences, Trinity College Dublin, James' Street, Dublin 8, Ireland
| | - Niamh M Ryan
- Neuropsychiatric Genetics Research Group, Department of Psychiatry, Trinity Centre for Health Sciences, Trinity College Dublin, James' Street, Dublin 8, Ireland
| | - William Byerley
- Department of Psychiatry and Behavioral Sciences, University of California, San Francisco, CA, USA
| | - Elizabeth A Heron
- Neuropsychiatric Genetics Research Group, Department of Psychiatry, Trinity Centre for Health Sciences, Trinity College Dublin, James' Street, Dublin 8, Ireland
| | - Aiden Corvin
- Neuropsychiatric Genetics Research Group, Department of Psychiatry, Trinity Centre for Health Sciences, Trinity College Dublin, James' Street, Dublin 8, Ireland.
| |
Collapse
|
5
|
Sarwal V, Lee S, Yang J, Sankararaman S, Chaisson M, Eskin E, Mangul S. VISTA: an integrated framework for structural variant discovery. Brief Bioinform 2024; 25:bbae462. [PMID: 39297879 PMCID: PMC11411772 DOI: 10.1093/bib/bbae462] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/04/2024] [Revised: 08/27/2024] [Accepted: 09/07/2024] [Indexed: 09/26/2024] Open
Abstract
Structural variation (SV) refers to insertions, deletions, inversions, and duplications in human genomes. SVs are present in approximately 1.5% of the human genome. Still, this small subset of genetic variation has been implicated in the pathogenesis of psoriasis, Crohn's disease and other autoimmune disorders, autism spectrum and other neurodevelopmental disorders, and schizophrenia. Since identifying structural variants is an important problem in genetics, several specialized computational techniques have been developed to detect structural variants directly from sequencing data. With advances in whole-genome sequencing (WGS) technologies, a plethora of SV detection methods have been developed. However, dissecting SVs from WGS data remains a challenge, with the majority of SV detection methods prone to a high false-positive rate, and no existing method able to precisely detect a full range of SVs present in a sample. Previous studies have shown that none of the existing SV callers can maintain high accuracy across various SV lengths and genomic coverages. Here, we report an integrated structural variant calling framework, Variant Identification and Structural Variant Analysis (VISTA), that leverages the results of individual callers using a novel and robust filtering and merging algorithm. In contrast to existing consensus-based tools which ignore the length and coverage, VISTA overcomes this limitation by executing various combinations of top-performing callers based on variant length and genomic coverage to generate SV events with high accuracy. We evaluated the performance of VISTA on comprehensive gold-standard datasets across varying organisms and coverage. We benchmarked VISTA using the Genome-in-a-Bottle gold standard SV set, haplotype-resolved de novo assemblies from the Human Pangenome Reference Consortium, along with an in-house polymerase chain reaction (PCR)-validated mouse gold standard set. VISTA maintained the highest F1 score among top consensus-based tools measured using a comprehensive gold standard across both mouse and human genomes. VISTA also has an optimized mode, where the calls can be optimized for precision or recall. VISTA-optimized can attain 100% precision and the highest sensitivity among other variant callers. In conclusion, VISTA represents a significant advancement in structural variant calling, offering a robust and accurate framework that outperforms existing consensus-based tools and sets a new standard for SV detection in genomic research.
Collapse
Affiliation(s)
- Varuni Sarwal
- Department of Computer Science, University of California Los Angeles, 580 Portola Plaza, Los Angeles, CA 90095, United States
| | - Seungmo Lee
- Department of Computer Science, University of California Los Angeles, 580 Portola Plaza, Los Angeles, CA 90095, United States
| | - Jianzhi Yang
- Department of Quantitative and Computational Biology, Dana and David Dornsife College of Letters, Arts and Sciences University of Southern California, 3540 S Figueroa St, Los Angeles, California 90089, United States
| | - Sriram Sankararaman
- Department of Computer Science, University of California Los Angeles, 580 Portola Plaza, Los Angeles, CA 90095, United States
| | - Mark Chaisson
- Department of Quantitative and Computational Biology, Dana and David Dornsife College of Letters, Arts and Sciences University of Southern California, 3540 S Figueroa St, Los Angeles, California 90089, United States
| | - Eleazar Eskin
- Department of Computer Science, University of California Los Angeles, 580 Portola Plaza, Los Angeles, CA 90095, United States
| | - Serghei Mangul
- Department of Quantitative and Computational Biology, Dana and David Dornsife College of Letters, Arts and Sciences University of Southern California, 3540 S Figueroa St, Los Angeles, California 90089, United States
- Department of Clinical Pharmacy, Alfred E. Mann School of Pharmacy, University of Southern California, 1540 Alcazar Street, Los Angeles, CA 90033, United States
| |
Collapse
|
6
|
Barbitoff YA, Ushakov MO, Lazareva TE, Nasykhova YA, Glotov AS, Predeus AV. Bioinformatics of germline variant discovery for rare disease diagnostics: current approaches and remaining challenges. Brief Bioinform 2024; 25:bbad508. [PMID: 38271481 PMCID: PMC10810331 DOI: 10.1093/bib/bbad508] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/09/2023] [Revised: 11/18/2023] [Accepted: 12/12/2023] [Indexed: 01/27/2024] Open
Abstract
Next-generation sequencing (NGS) has revolutionized the field of rare disease diagnostics. Whole exome and whole genome sequencing are now routinely used for diagnostic purposes; however, the overall diagnosis rate remains lower than expected. In this work, we review current approaches used for calling and interpretation of germline genetic variants in the human genome, and discuss the most important challenges that persist in the bioinformatic analysis of NGS data in medical genetics. We describe and attempt to quantitatively assess the remaining problems, such as the quality of the reference genome sequence, reproducible coverage biases, or variant calling accuracy in complex regions of the genome. We also discuss the prospects of switching to the complete human genome assembly or the human pan-genome and important caveats associated with such a switch. We touch on arguably the hardest problem of NGS data analysis for medical genomics, namely, the annotation of genetic variants and their subsequent interpretation. We highlight the most challenging aspects of annotation and prioritization of both coding and non-coding variants. Finally, we demonstrate the persistent prevalence of pathogenic variants in the coding genome, and outline research directions that may enhance the efficiency of NGS-based disease diagnostics.
Collapse
Affiliation(s)
- Yury A Barbitoff
- Dpt. of Genomic Medicine, D.O. Ott Research Institute of Obstetrics, Gynaecology, and Reproductology, Mendeleevskaya line 3, 199034, St. Petersburg, Russia
- Bioinformatics Institute, Kentemirovskaya st. 2A, 197342, St. Petersburg, Russia
| | - Mikhail O Ushakov
- Dpt. of Genomic Medicine, D.O. Ott Research Institute of Obstetrics, Gynaecology, and Reproductology, Mendeleevskaya line 3, 199034, St. Petersburg, Russia
| | - Tatyana E Lazareva
- Dpt. of Genomic Medicine, D.O. Ott Research Institute of Obstetrics, Gynaecology, and Reproductology, Mendeleevskaya line 3, 199034, St. Petersburg, Russia
| | - Yulia A Nasykhova
- Dpt. of Genomic Medicine, D.O. Ott Research Institute of Obstetrics, Gynaecology, and Reproductology, Mendeleevskaya line 3, 199034, St. Petersburg, Russia
| | - Andrey S Glotov
- Dpt. of Genomic Medicine, D.O. Ott Research Institute of Obstetrics, Gynaecology, and Reproductology, Mendeleevskaya line 3, 199034, St. Petersburg, Russia
| | - Alexander V Predeus
- Bioinformatics Institute, Kentemirovskaya st. 2A, 197342, St. Petersburg, Russia
| |
Collapse
|
7
|
Gaitán N, Duitama J. A graph clustering algorithm for detection and genotyping of structural variants from long reads. Gigascience 2024; 13:giad112. [PMID: 38206589 PMCID: PMC10783151 DOI: 10.1093/gigascience/giad112] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/29/2023] [Revised: 08/02/2023] [Accepted: 12/08/2023] [Indexed: 01/12/2024] Open
Abstract
BACKGROUND Structural variants (SVs) are genomic polymorphisms defined by their length (>50 bp). The usual types of SVs are deletions, insertions, translocations, inversions, and copy number variants. SV detection and genotyping is fundamental given the role of SVs in phenomena such as phenotypic variation and evolutionary events. Thus, methods to identify SVs using long-read sequencing data have been recently developed. FINDINGS We present an accurate and efficient algorithm to predict germline SVs from long-read sequencing data. The algorithm starts collecting evidence (signatures) of SVs from read alignments. Then, signatures are clustered based on a Euclidean graph with coordinates calculated from lengths and genomic positions. Clustering is performed by the DBSCAN algorithm, which provides the advantage of delimiting clusters with high resolution. Clusters are transformed into SVs and a Bayesian model allows to precisely genotype SVs based on their supporting evidence. This algorithm is integrated into the single sample variants detector of the Next Generation Sequencing Experience Platform, which facilitates the integration with other functionalities for genomics analysis. We performed multiple benchmark experiments, including simulation and real data, representing different genome profiles, sequencing technologies (PacBio HiFi, ONT), and read depths. CONCLUSION The results show that our approach outperformed state-of-the-art tools on germline SV calling and genotyping, especially at low depths, and in error-prone repetitive regions. We believe this work significantly contributes to the development of bioinformatic strategies to maximize the use of long-read sequencing technologies.
Collapse
Affiliation(s)
- Nicolás Gaitán
- Systems and Computing Engineering Department, Universidad de Los Andes, Bogotá 111711, Colombia
| | - Jorge Duitama
- Systems and Computing Engineering Department, Universidad de Los Andes, Bogotá 111711, Colombia
| |
Collapse
|
8
|
Lu N, Qiao Y, An P, Luo J, Bi C, Li M, Lu Z, Tu J. Exploration of whole genome amplification generated chimeric sequences in long-read sequencing data. Brief Bioinform 2023; 24:bbad275. [PMID: 37529913 DOI: 10.1093/bib/bbad275] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/20/2023] [Revised: 06/21/2023] [Accepted: 07/10/2023] [Indexed: 08/03/2023] Open
Abstract
MOTIVATION Multiple displacement amplification (MDA) has become the most commonly used method of whole genome amplification, generating a vast amount of DNA with higher molecular weight and greater genome coverage. Coupling with long-read sequencing, it is possible to sequence the amplicons of over 20 kb in length. However, the formation of chimeric sequences (chimeras, expressed as structural errors in sequencing data) in MDA seriously interferes with the bioinformatics analysis but its influence on long-read sequencing data is unknown. RESULTS We sequenced the phi29 DNA polymerase-mediated MDA amplicons on the PacBio platform and analyzed chimeras within the generated data. The 3rd-ChimeraMiner has been constructed as a pipeline for recognizing and restoring chimeras into the original structures in long-read sequencing data, improving the efficiency of using TGS data. Five long-read datasets and one high-fidelity long-read dataset with various amplification folds were analyzed. The result reveals that the mis-priming events in amplification are more frequently occurring than widely perceived, and the propor tion gradually accumulates from 42% to over 78% as the amplification continues. In total, 99.92% of recognized chimeric sequences were demonstrated to be artifacts, whose structures were wrongly formed in MDA instead of existing in original genomes. By restoring chimeras to their original structures, the vast majority of supplementary alignments that introduce false-positive structural variants are recycled, removing 97% of inversions on average and contributing to the analysis of structural variation in MDA-amplified samples. The impact of chimeras in long-read sequencing data analysis should be emphasized, and the 3rd-ChimeraMiner can help to quantify and reduce the influence of chimeras. AVAILABILITY AND IMPLEMENTATION The 3rd-ChimeraMiner is available on GitHub, https://github.com/dulunar/3rdChimeraMiner.
Collapse
Affiliation(s)
- Na Lu
- State Key Laboratory of Bioelectronics, School of Biological Science and Medical Engineering, Southeast University, Nanjing 210096, China
| | - Yi Qiao
- State Key Laboratory of Bioelectronics, School of Biological Science and Medical Engineering, Southeast University, Nanjing 210096, China
| | - Pengfei An
- State Key Laboratory of Bioelectronics, School of Biological Science and Medical Engineering, Southeast University, Nanjing 210096, China
- Monash University-Southeast University Joint Research Institute, Suzhou 215123, China
| | - Jiajian Luo
- State Key Laboratory of Bioelectronics, School of Biological Science and Medical Engineering, Southeast University, Nanjing 210096, China
| | - Changwei Bi
- College of Information Science and Technology, Nanjing Forestry University, Nanjing 210037, China
| | - Musheng Li
- Department of Physiology and Cell Biology, University of Nevada, Reno School of Medicine, Reno, NV 89511, USA
| | - Zuhong Lu
- State Key Laboratory of Bioelectronics, School of Biological Science and Medical Engineering, Southeast University, Nanjing 210096, China
| | - Jing Tu
- State Key Laboratory of Bioelectronics, School of Biological Science and Medical Engineering, Southeast University, Nanjing 210096, China
| |
Collapse
|
9
|
Divakar MK, Jain A, Bhoyar RC, Senthivel V, Jolly B, Imran M, Sharma D, Bajaj A, Gupta V, Scaria V, Sivasubbu S. Whole-genome sequencing of 1029 Indian individuals reveals unique and rare structural variants. J Hum Genet 2023; 68:409-417. [PMID: 36813834 DOI: 10.1038/s10038-023-01131-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/14/2022] [Revised: 01/31/2023] [Accepted: 02/06/2023] [Indexed: 02/24/2023]
Abstract
Structural variants contribute to genetic variability in human genomes and they can be presented in population-specific patterns. We aimed to understand the landscape of structural variants in the genomes of healthy Indian individuals and explore their potential implications in genetic disease conditions. For the identification of structural variants, a whole genome sequencing dataset of 1029 self-declared healthy Indian individuals from the IndiGen project was analysed. Further, these variants were evaluated for potential pathogenicity and their associations with genetic diseases. We also compared our identified variations with the existing global datasets. We generated a compendium of total 38,560 high-confident structural variants, comprising 28,393 deletions, 5030 duplications, 5038 insertions, and 99 inversions. Particularly, we identified around 55% of all these variants were found to be unique to the studied population. Further analysis revealed 134 deletions with predicted pathogenic/likely pathogenic effects and their affected genes were majorly enriched for neurological disease conditions, such as intellectual disability and neurodegenerative diseases. The IndiGenomes dataset helped us to understand the unique spectrum of structural variants in the Indian population. More than half of identified variants were not present in the publicly available global dataset on structural variants. Clinically important deletions identified in IndiGenomes might aid in improving the diagnosis of unsolved genetic diseases, particularly in neurological conditions. Along with basal allele frequency data and clinically important deletions, IndiGenomes data might serve as a baseline resource for future studies on genomic structural variant analysis in the Indian population.
Collapse
Affiliation(s)
- Mohit Kumar Divakar
- CSIR-Institute of Genomics and Integrative Biology (CSIR-IGIB), Mathura Road, New Delhi, 110025, India.,Academy of Scientific and Innovative Research (AcSIR), Ghaziabad, 201002, India
| | - Abhinav Jain
- CSIR-Institute of Genomics and Integrative Biology (CSIR-IGIB), Mathura Road, New Delhi, 110025, India.,Academy of Scientific and Innovative Research (AcSIR), Ghaziabad, 201002, India
| | - Rahul C Bhoyar
- CSIR-Institute of Genomics and Integrative Biology (CSIR-IGIB), Mathura Road, New Delhi, 110025, India
| | - Vigneshwar Senthivel
- CSIR-Institute of Genomics and Integrative Biology (CSIR-IGIB), Mathura Road, New Delhi, 110025, India.,Academy of Scientific and Innovative Research (AcSIR), Ghaziabad, 201002, India
| | - Bani Jolly
- CSIR-Institute of Genomics and Integrative Biology (CSIR-IGIB), Mathura Road, New Delhi, 110025, India.,Academy of Scientific and Innovative Research (AcSIR), Ghaziabad, 201002, India
| | - Mohamed Imran
- CSIR-Institute of Genomics and Integrative Biology (CSIR-IGIB), Mathura Road, New Delhi, 110025, India.,Academy of Scientific and Innovative Research (AcSIR), Ghaziabad, 201002, India
| | - Disha Sharma
- CSIR-Institute of Genomics and Integrative Biology (CSIR-IGIB), Mathura Road, New Delhi, 110025, India.,Academy of Scientific and Innovative Research (AcSIR), Ghaziabad, 201002, India
| | - Anjali Bajaj
- CSIR-Institute of Genomics and Integrative Biology (CSIR-IGIB), Mathura Road, New Delhi, 110025, India.,Academy of Scientific and Innovative Research (AcSIR), Ghaziabad, 201002, India
| | - Vishu Gupta
- CSIR-Institute of Genomics and Integrative Biology (CSIR-IGIB), Mathura Road, New Delhi, 110025, India.,Academy of Scientific and Innovative Research (AcSIR), Ghaziabad, 201002, India
| | - Vinod Scaria
- CSIR-Institute of Genomics and Integrative Biology (CSIR-IGIB), Mathura Road, New Delhi, 110025, India. .,Academy of Scientific and Innovative Research (AcSIR), Ghaziabad, 201002, India.
| | - Sridhar Sivasubbu
- CSIR-Institute of Genomics and Integrative Biology (CSIR-IGIB), Mathura Road, New Delhi, 110025, India. .,Academy of Scientific and Innovative Research (AcSIR), Ghaziabad, 201002, India.
| |
Collapse
|
10
|
Alser M, Lindegger J, Firtina C, Almadhoun N, Mao H, Singh G, Gomez-Luna J, Mutlu O. From molecules to genomic variations: Accelerating genome analysis via intelligent algorithms and architectures. Comput Struct Biotechnol J 2022; 20:4579-4599. [PMID: 36090814 PMCID: PMC9436709 DOI: 10.1016/j.csbj.2022.08.019] [Citation(s) in RCA: 7] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/12/2022] [Revised: 08/08/2022] [Accepted: 08/08/2022] [Indexed: 02/01/2023] Open
Abstract
We now need more than ever to make genome analysis more intelligent. We need to read, analyze, and interpret our genomes not only quickly, but also accurately and efficiently enough to scale the analysis to population level. There currently exist major computational bottlenecks and inefficiencies throughout the entire genome analysis pipeline, because state-of-the-art genome sequencing technologies are still not able to read a genome in its entirety. We describe the ongoing journey in significantly improving the performance, accuracy, and efficiency of genome analysis using intelligent algorithms and hardware architectures. We explain state-of-the-art algorithmic methods and hardware-based acceleration approaches for each step of the genome analysis pipeline and provide experimental evaluations. Algorithmic approaches exploit the structure of the genome as well as the structure of the underlying hardware. Hardware-based acceleration approaches exploit specialized microarchitectures or various execution paradigms (e.g., processing inside or near memory) along with algorithmic changes, leading to new hardware/software co-designed systems. We conclude with a foreshadowing of future challenges, benefits, and research directions triggered by the development of both very low cost yet highly error prone new sequencing technologies and specialized hardware chips for genomics. We hope that these efforts and the challenges we discuss provide a foundation for future work in making genome analysis more intelligent.
Collapse
Affiliation(s)
| | | | - Can Firtina
- ETH Zurich, Gloriastrasse 35, 8092 Zürich, Switzerland
| | | | - Haiyu Mao
- ETH Zurich, Gloriastrasse 35, 8092 Zürich, Switzerland
| | | | | | - Onur Mutlu
- ETH Zurich, Gloriastrasse 35, 8092 Zürich, Switzerland
| |
Collapse
|