1
|
Durward-Akhurst SA, Schaefer RJ, Grantham B, Carey WK, Mickelson JR, McCue ME. Genetic Variation and the Distribution of Variant Types in the Horse. Front Genet 2021; 12:758366. [PMID: 34925451 PMCID: PMC8676274 DOI: 10.3389/fgene.2021.758366] [Citation(s) in RCA: 7] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/16/2021] [Accepted: 11/10/2021] [Indexed: 11/13/2022] Open
Abstract
Genetic variation is a key contributor to health and disease. Understanding the link between an individual's genotype and the corresponding phenotype is a major goal of medical genetics. Whole genome sequencing (WGS) within and across populations enables highly efficient variant discovery and elucidation of the molecular nature of virtually all genetic variation. Here, we report the largest catalog of genetic variation for the horse, a species of importance as a model for human athletic and performance related traits, using WGS of 534 horses. We show the extent of agreement between two commonly used variant callers. In data from ten target breeds that represent major breed clusters in the domestic horse, we demonstrate the distribution of variants, their allele frequencies across breeds, and identify variants that are unique to a single breed. We investigate variants with no homozygotes that may be potential embryonic lethal variants, as well as variants present in all individuals that likely represent regions of the genome with errors, poor annotation or where the reference genome carries a variant. Finally, we show regions of the genome that have higher or lower levels of genetic variation compared to the genome average. This catalog can be used for variant prioritization for important equine diseases and traits, and to provide key information about regions of the genome where the assembly and/or annotation need to be improved.
Collapse
Affiliation(s)
- S. A. Durward-Akhurst
- Department of Veterinary Population Medicine, University of Minnesota, Minneapolis, MN, United States
| | - R. J. Schaefer
- Department of Veterinary Population Medicine, University of Minnesota, Minneapolis, MN, United States
| | - B. Grantham
- Interval Bio LLC, Mountain View, CA, United States
| | - W. K. Carey
- Interval Bio LLC, Mountain View, CA, United States
| | - J. R. Mickelson
- Department of Veterinary and Biomedical Sciences, University of Minnesota, Minneapolis, MN, United States
| | - M. E. McCue
- Department of Veterinary Population Medicine, University of Minnesota, Minneapolis, MN, United States
| |
Collapse
|
2
|
Mohammadi-Dehcheshmeh M, Moghbeli SM, Rahimirad S, Alanazi IO, Shehri ZSA, Ebrahimie E. A Transcription Regulatory Sequence in the 5' Untranslated Region of SARS-CoV-2 Is Vital for Virus Replication with an Altered Evolutionary Pattern against Human Inhibitory MicroRNAs. Cells 2021; 10:cells10020319. [PMID: 33557205 PMCID: PMC7913991 DOI: 10.3390/cells10020319] [Citation(s) in RCA: 11] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/24/2020] [Revised: 01/23/2021] [Accepted: 01/27/2021] [Indexed: 12/12/2022] Open
Abstract
Our knowledge of the evolution and the role of untranslated region (UTR) in SARS-CoV-2 pathogenicity is very limited. Leader sequence, originated from UTR, is found at the 5' ends of all encoded SARS-CoV-2 transcripts, highlighting its importance. Here, evolution of leader sequence was compared between human pathogenic and non-pathogenic coronaviruses. Then, profiling of microRNAs that can inactivate the key UTR regions of coronaviruses was carried out. A distinguished pattern of evolution in leader sequence of SARS-CoV-2 was found. Mining all available microRNA families against leader sequences of coronaviruses resulted in discovery of 39 microRNAs with a stable thermodynamic binding energy. Notably, SARS-CoV-2 had a lower binding stability against microRNAs. hsa-MIR-5004-3p was the only human microRNA able to target the leader sequence of SARS and to a lesser extent, also SARS-CoV-2. However, its binding stability decreased remarkably in SARS-COV-2. We found some plant microRNAs with low and stable binding energy against SARS-COV-2. Meta-analysis documented a significant (p < 0.01) decline in the expression of MIR-5004-3p after SARS-COV-2 infection in trachea, lung biopsy, and bronchial organoids as well as lung-derived Calu-3 and A549 cells. The paucity of the innate human inhibitory microRNAs to bind to leader sequence of SARS-CoV-2 can contribute to its high replication in infected human cells.
Collapse
Affiliation(s)
- Manijeh Mohammadi-Dehcheshmeh
- La Trobe Genomics Research Platform, School of Life Sciences, College of Science, Health and Engineering, La Trobe University, Melbourne, VIC 3086, Australia; or
- School of Animal and Veterinary Sciences, The University of Adelaide, Adelaide, SA 5371, Australia
| | - Sadrollah Molaei Moghbeli
- Department of Animal Science, College of Agricultural and Life Sciences, University of Wisconsin, Madison, WI 1675, USA;
| | - Samira Rahimirad
- Department of Medical Genetics, National Institute of Genetic Engineering and Biotechnology, Tehran 1497716316, Iran;
| | - Ibrahim O. Alanazi
- National Center for Biotechnology, Life Science and Environment Research Institute, King Abdulaziz City for Science and Technology (KACST), Riyadh 6086, Saudi Arabia;
| | - Zafer Saad Al Shehri
- Department of Medical Laboratories, College of Applied Medical Sciences, Shaqra University, KSA, Al dawadmi 1678, Saudi Arabia;
| | - Esmaeil Ebrahimie
- La Trobe Genomics Research Platform, School of Life Sciences, College of Science, Health and Engineering, La Trobe University, Melbourne, VIC 3086, Australia; or
- School of Animal and Veterinary Sciences, The University of Adelaide, Adelaide, SA 5371, Australia
- School of BioSciences, The University of Melbourne, Melbourne, VIC 3052, Australia
- Correspondence: ; Tel.: +61-4491-213-57
| |
Collapse
|
3
|
Abstract
Modern DNA sequencing has instituted a new era in human cytomegalovirus (HCMV) genomics. A key development has been the ability to determine the genome sequences of HCMV strains directly from clinical material. This involves the application of complex and often non-standardized bioinformatics approaches to analysing data of variable quality in a process that requires substantial manual intervention. To relieve this bottleneck, we have developed GRACy (Genome Reconstruction and Annotation of Cytomegalovirus), an easy-to-use toolkit for analysing HCMV sequence data. GRACy automates and integrates modules for read filtering, genotyping, genome assembly, genome annotation, variant analysis, and data submission. These modules were tested extensively on simulated and experimental data and outperformed generic approaches. GRACy is written in Python and is embedded in a graphical user interface with all required dependencies installed by a single command. It runs on the Linux operating system and is designed to allow the future implementation of a cross-platform version. GRACy is distributed under a GPL 3.0 license and is freely available at https://bioinformatics.cvr.ac.uk/software/ with the manual and a test dataset.
Collapse
Affiliation(s)
| | - Nicolás M Suárez
- MRC-University of Glasgow Centre for Virus Research, Glasgow, UK
| | - Antonia Chalka
- Division of Infection & Immunity, Roslin Institute, R(D)SVM, University of Edinburgh, Edinburgh, UK
| | - Cristina Venturini
- Division of Infection and Immunity, University College London, London, UK
| | - Judith Breuer
- Division of Infection and Immunity, University College London, London, UK
| | - Andrew J Davison
- MRC-University of Glasgow Centre for Virus Research, Glasgow, UK
| |
Collapse
|
4
|
Dimonaco NJ, Salavati M, Shih BB. Computational Analysis of SARS-CoV-2 and SARS-Like Coronavirus Diversity in Human, Bat and Pangolin Populations. Viruses 2020; 13:E49. [PMID: 33396801 PMCID: PMC7823979 DOI: 10.3390/v13010049] [Citation(s) in RCA: 12] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/03/2020] [Revised: 12/21/2020] [Accepted: 12/22/2020] [Indexed: 12/14/2022] Open
Abstract
In 2019, a novel coronavirus, SARS-CoV-2/nCoV-19, emerged in Wuhan, China, and has been responsible for the current COVID-19 pandemic. The evolutionary origins of the virus remain elusive and understanding its complex mutational signatures could guide vaccine design and development. As part of the international "CoronaHack" in April 2020, we employed a collection of contemporary methodologies to compare the genomic sequences of coronaviruses isolated from human (SARS-CoV-2; n = 163), bat (bat-CoV; n = 215) and pangolin (pangolin-CoV; n = 7) available in public repositories. We have also noted the pangolin-CoV isolate MP789 to bare stronger resemblance to SARS-CoV-2 than other pangolin-CoV. Following de novo gene annotation prediction, analyses of gene-gene similarity network, codon usage bias and variant discovery were undertaken. Strong host-associated divergences were noted in ORF3a, ORF6, ORF7a, ORF8 and S, and in codon usage bias profiles. Last, we have characterised several high impact variants (in-frame insertion/deletion or stop gain) in bat-CoV and pangolin-CoV populations, some of which are found in the same amino acid position and may be highlighting loci of potential functional relevance.
Collapse
Affiliation(s)
- Nicholas J. Dimonaco
- Institute of Biological, Environmental and Rural Sciences, Aberystwyth University, Wales SY3 3FL, UK
| | - Mazdak Salavati
- The Roslin Institute, Royal (Dick) School of Veterinary Studies, University of Edinburgh, Easter Bush, Midlothian EH25 9RG, UK
| | - Barbara B. Shih
- The Roslin Institute, Royal (Dick) School of Veterinary Studies, University of Edinburgh, Easter Bush, Midlothian EH25 9RG, UK
| |
Collapse
|
5
|
Borges MG, de Moraes HT, Rocha CDS, Lopes-Cendes I. The impact of post-alignment processing procedures on whole-exome sequencing data. Genet Mol Biol 2020; 43:e20200047. [PMID: 33306778 PMCID: PMC7783507 DOI: 10.1590/1678-4685-gmb-2020-0047] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/26/2020] [Accepted: 09/18/2020] [Indexed: 12/01/2022] Open
Abstract
The use of post-alignment procedures has been suggested to prevent the identification of false-positives in massive DNA sequencing data. Insertions and deletions are most likely to be misinterpreted by variant calling algorithms. Using known genetic variants as references for post-processing pipelines can minimize mismatches. They allow reads to be correctly realigned and recalibrated, resulting in more parsimonious variant calling. In this work, we aim to investigate the impact of using different sets of common variants as references to facilitate variant calling from whole-exome sequencing data. We selected reference variants from common insertions and deletions available within the 1K Genomes project data and from databases from the Latin American Database of Genetic Variation (LatinGen). We used the Genome Analysis Toolkit to perform post-processing procedures like local realignment, quality recalibration procedures, and variant calling in whole exome samples. We identified an increased number of variants from the call set for all groups when no post-processing procedure was performed. We found that there was a higher concordance rate between variants called using 1K Genomes and LatinGen. Therefore, we believe that the increased number of rare variants identified in the analysis without realignment or quality recalibration indicated that they were likely false-positives.
Collapse
Affiliation(s)
- Murilo Guimarães Borges
- Universidade Estadual de Campinas (UNICAMP), Faculdade de Ciências
Médicas, Departamento de Genética Médica e Medicina Genômica, Campinas, SP,
Brazil
- Instituto Brasileiro de Neurociência e Neurotecnologia (BRAINN),
Campinas, SP, Brazil
- Universidade Estadual de Campinas (UNICAMP), Centro de Engenharia
Biomédica. Campinas, SP, Brazil
| | - Helena Tadiello de Moraes
- Universidade Estadual de Campinas (UNICAMP), Faculdade de Ciências
Médicas, Departamento de Genética Médica e Medicina Genômica, Campinas, SP,
Brazil
- Instituto Brasileiro de Neurociência e Neurotecnologia (BRAINN),
Campinas, SP, Brazil
| | - Cristiane de Souza Rocha
- Universidade Estadual de Campinas (UNICAMP), Faculdade de Ciências
Médicas, Departamento de Genética Médica e Medicina Genômica, Campinas, SP,
Brazil
- Instituto Brasileiro de Neurociência e Neurotecnologia (BRAINN),
Campinas, SP, Brazil
| | - Iscia Lopes-Cendes
- Universidade Estadual de Campinas (UNICAMP), Faculdade de Ciências
Médicas, Departamento de Genética Médica e Medicina Genômica, Campinas, SP,
Brazil
- Instituto Brasileiro de Neurociência e Neurotecnologia (BRAINN),
Campinas, SP, Brazil
| |
Collapse
|
6
|
Olson ND, Treangen TJ, Hill CM, Cepeda-Espinoza V, Ghurye J, Koren S, Pop M. Metagenomic assembly through the lens of validation: recent advances in assessing and improving the quality of genomes assembled from metagenomes. Brief Bioinform 2020; 20:1140-1150. [PMID: 28968737 DOI: 10.1093/bib/bbx098] [Citation(s) in RCA: 73] [Impact Index Per Article: 18.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/02/2017] [Revised: 07/13/2017] [Indexed: 01/09/2023] Open
Abstract
Metagenomic samples are snapshots of complex ecosystems at work. They comprise hundreds of known and unknown species, contain multiple strain variants and vary greatly within and across environments. Many microbes found in microbial communities are not easily grown in culture making their DNA sequence our only clue into their evolutionary history and biological function. Metagenomic assembly is a computational process aimed at reconstructing genes and genomes from metagenomic mixtures. Current methods have made significant strides in reconstructing DNA segments comprising operons, tandem gene arrays and syntenic blocks. Shorter, higher-throughput sequencing technologies have become the de facto standard in the field. Sequencers are now able to generate billions of short reads in only a few days. Multiple metagenomic assembly strategies, pipelines and assemblers have appeared in recent years. Owing to the inherent complexity of metagenome assembly, regardless of the assembly algorithm and sequencing method, metagenome assemblies contain errors. Recent developments in assembly validation tools have played a pivotal role in improving metagenomics assemblers. Here, we survey recent progress in the field of metagenomic assembly, provide an overview of key approaches for genomic and metagenomic assembly validation and demonstrate the insights that can be derived from assemblies through the use of assembly validation strategies. We also discuss the potential for impact of long-read technologies in metagenomics. We conclude with a discussion of future challenges and opportunities in the field of metagenomic assembly and validation.
Collapse
|
7
|
Lowy-Gallego E, Fairley S, Zheng-Bradley X, Ruffier M, Clarke L, Flicek P. Variant calling on the GRCh38 assembly with the data from phase three of the 1000 Genomes Project. Wellcome Open Res 2019; 4:50. [PMID: 32175479 PMCID: PMC7059836 DOI: 10.12688/wellcomeopenres.15126.2] [Citation(s) in RCA: 43] [Impact Index Per Article: 8.6] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 12/17/2019] [Indexed: 12/20/2022] Open
Abstract
We present a set of biallelic SNVs and INDELs, from 2,548 samples spanning 26 populations from the 1000 Genomes Project, called
de novo on GRCh38. We believe this will be a useful reference resource for those using GRCh38. It represents an improvement over the “lift-overs” of the 1000 Genomes Project data that have been available to date by encompassing all of the GRCh38 primary assembly autosomes and pseudo-autosomal regions, including novel, medically relevant loci. Here, we describe how the data set was created and benchmark our call set against that produced by the final phase of the 1000 Genomes Project on GRCh37 and the lift-over of that data to GRCh38.
Collapse
Affiliation(s)
- Ernesto Lowy-Gallego
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge, CB10 1SD, UK
| | - Susan Fairley
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge, CB10 1SD, UK
| | - Xiangqun Zheng-Bradley
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge, CB10 1SD, UK
| | - Magali Ruffier
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge, CB10 1SD, UK
| | - Laura Clarke
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge, CB10 1SD, UK
| | - Paul Flicek
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge, CB10 1SD, UK
| | | |
Collapse
|
8
|
Caruana BM, Pembleton LW, Constable F, Rodoni B, Slater AT, Cogan NOI. Validation of Genotyping by Sequencing Using Transcriptomics for Diversity and Application of Genomic Selection in Tetraploid Potato. Front Plant Sci 2019; 10:670. [PMID: 31191581 PMCID: PMC6548859 DOI: 10.3389/fpls.2019.00670] [Citation(s) in RCA: 17] [Impact Index Per Article: 3.4] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 08/24/2018] [Accepted: 05/03/2019] [Indexed: 05/10/2023]
Abstract
Potato is an important food crop due to its increasing consumption, and as a result, there is demand for varieties with improved production. However, the current status of breeding for improved varieties is a long process which relies heavily on phenotypic evaluation and dated molecular techniques and has little emphasis on modern genotyping approaches. Evaluation and selection before a cultivar is commercialized typically takes 10-15 years. Molecular markers have been developed for disease and pest resistance, resulting in initial marker-assisted selection in breeding. This study has evaluated and implemented a high-throughput transcriptome sequencing method for dense marker discovery in potato for the application of genomic selection. An Australian relevant collection of commercial cultivars was selected, and identification and distribution of high quality SNPs were examined using standard bioinformatic pipelines and a custom approach for the prediction of allelic dosage. As a result, a large number of SNP markers were identified and filtered to generate a high-quality subset that was then combined with historic phenotypic data to assess the approach for genomic selection. Genomic selection potential was predicted for highly heritable traits and the approach demonstrated advantages over the previously used technologies in terms of markers identified as well as costs incurred. The high-quality SNP list also provided acceptable genome coverage which demonstrates its applicability for much larger future studies. This SNP list was also annotated to provide an indication of function and will serve as a resource for the community in future studies. Genome wide marker tools will provide significant benefits for potato breeding efforts and the application of genomic selection will greatly enhance genetic progress.
Collapse
Affiliation(s)
- B. M. Caruana
- Agriculture Victoria Research, Agriculture Victoria, AgriBio, The Centre for AgriBioscience, Bundoora, VIC, Australia
- School of Applied Systems Biology, La Trobe University, Bundoora, VIC, Australia
| | - L. W. Pembleton
- Agriculture Victoria Research, Agriculture Victoria, AgriBio, The Centre for AgriBioscience, Bundoora, VIC, Australia
| | - F. Constable
- Agriculture Victoria Research, Agriculture Victoria, AgriBio, The Centre for AgriBioscience, Bundoora, VIC, Australia
| | - B. Rodoni
- Agriculture Victoria Research, Agriculture Victoria, AgriBio, The Centre for AgriBioscience, Bundoora, VIC, Australia
- School of Applied Systems Biology, La Trobe University, Bundoora, VIC, Australia
| | - A. T. Slater
- Agriculture Victoria Research, Agriculture Victoria, AgriBio, The Centre for AgriBioscience, Bundoora, VIC, Australia
| | - N. O. I. Cogan
- Agriculture Victoria Research, Agriculture Victoria, AgriBio, The Centre for AgriBioscience, Bundoora, VIC, Australia
- School of Applied Systems Biology, La Trobe University, Bundoora, VIC, Australia
- *Correspondence: N. O. I. Cogan,
| |
Collapse
|
9
|
Abstract
A superbubble is a type of directed acyclic subgraph with single distinct source and sink vertices. In genome assembly and genetics, the possible paths through a superbubble can be considered to represent the set of possible sequences at a location in a genome. Bidirected and biedged graphs are a generalization of digraphs that are increasingly being used to more fully represent genome assembly and variation problems. In this study, we define snarls and ultrabubbles, generalizations of superbubbles for bidirected and biedged graphs, and give an efficient algorithm for the detection of these more general structures. Key to this algorithm is the cactus graph, which, we show, encodes the nested decomposition of a graph into snarls and ultrabubbles within its structure. We propose and demonstrate empirically that this decomposition on bidirected and biedged graphs solves a fundamental problem by defining genetic sites for any collection of genomic variations, including complex structural variations, without need for any single reference genome coordinate system. Further, the nesting of the decomposition gives a natural way to describe and model variations contained within large variations, a case not currently dealt with by existing formats [e.g., variant cell format (VCF)].
Collapse
Affiliation(s)
- Benedict Paten
- 1 UC Santa Cruz Genomics Institute, University of California Santa Cruz , Santa Cruz, California
| | - Jordan M Eizenga
- 1 UC Santa Cruz Genomics Institute, University of California Santa Cruz , Santa Cruz, California
| | - Yohei M Rosen
- 1 UC Santa Cruz Genomics Institute, University of California Santa Cruz , Santa Cruz, California
| | - Adam M Novak
- 1 UC Santa Cruz Genomics Institute, University of California Santa Cruz , Santa Cruz, California
| | - Erik Garrison
- 2 Wellcome Trust Sanger Institute , Cambridge, United Kingdom
| | - Glenn Hickey
- 1 UC Santa Cruz Genomics Institute, University of California Santa Cruz , Santa Cruz, California
| |
Collapse
|
10
|
Wang S, Xing J. A primer for disease gene prioritization using next-generation sequencing data. Genomics Inform 2013; 11:191-9. [PMID: 24465230 PMCID: PMC3897846 DOI: 10.5808/gi.2013.11.4.191] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/18/2013] [Revised: 11/18/2013] [Accepted: 11/21/2013] [Indexed: 01/21/2023] Open
Abstract
High-throughput next-generation sequencing (NGS) technology produces a tremendous amount of raw sequence data. The challenges for researchers are to process the raw data, to map the sequences to genome, to discover variants that are different from the reference genome, and to prioritize/rank the variants for the question of interest. The recent development of many computational algorithms and programs has vastly improved the ability to translate sequence data into valuable information for disease gene identification. However, the NGS data analysis is complex and could be overwhelming for researchers who are not familiar with the process. Here, we outline the analysis pipeline and describe some of the most commonly used principles and tools for analyzing NGS data for disease gene identification.
Collapse
Affiliation(s)
- Shuoguo Wang
- Department of Genetics, The State University of New Jersey, Piscataway, NJ 08854, USA. ; Human Genetics Institute of New Jersey, Rutgers, The State University of New Jersey, Piscataway, NJ 08854, USA
| | - Jinchuan Xing
- Department of Genetics, The State University of New Jersey, Piscataway, NJ 08854, USA. ; Human Genetics Institute of New Jersey, Rutgers, The State University of New Jersey, Piscataway, NJ 08854, USA
| |
Collapse
|