1
|
Kojima S. Investigating mobile element variations by statistical genetics. Hum Genome Var 2024; 11:23. [PMID: 38816353 PMCID: PMC11140006 DOI: 10.1038/s41439-024-00280-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/31/2024] [Revised: 04/17/2024] [Accepted: 04/24/2024] [Indexed: 06/01/2024] Open
Abstract
The integration of structural variations (SVs) in statistical genetics provides an opportunity to understand the genetic factors influencing complex human traits and disease. Recent advances in long-read technology and variant calling methods for short reads have improved the accurate discovery and genotyping of SVs, enabling their use in expression quantitative trait loci (eQTL) analysis and genome-wide association studies (GWAS). Mobile elements are DNA sequences that insert themselves into various genome locations. Insertional polymorphisms of mobile elements between humans, called mobile element variations (MEVs), contribute to approximately 25% of human SVs. We recently developed a variant caller that can accurately identify and genotype MEVs from biobank-scale short-read whole-genome sequencing (WGS) datasets and integrate them into statistical genetics. The use of MEVs in eQTL analysis and GWAS has a minimal impact on the discovery of genome loci associated with gene expression and disease; most disease-associated haplotypes can be identified by single nucleotide variations (SNVs). On the other hand, it helps make hypotheses about causal variants or effector variants. Focusing on MEVs, we identified multiple MEVs that contribute to differential gene expression and one of them is a potential cause of skin disease, emphasizing the importance of the integration of MEVs in medical genetics. Here, I will provide an overview of MEVs, MEV calling from WGS, and the integration of MEVs in statistical genetics. Finally, I will discuss the unanswered questions about MEVs, such as rare variants.
Collapse
Affiliation(s)
- Shohei Kojima
- Genome Immunobiology RIKEN Hakubi Research Team, RIKEN Center for Integrative Medical Sciences, Yokohama, 230-0045, Japan.
| |
Collapse
|
2
|
Zhao P, Gu L, Gao Y, Pan Z, Liu L, Li X, Zhou H, Yu D, Han X, Qian L, Liu GE, Fang L, Wang Z. Young SINEs in pig genomes impact gene regulation, genetic diversity, and complex traits. Commun Biol 2023; 6:894. [PMID: 37652983 PMCID: PMC10471783 DOI: 10.1038/s42003-023-05234-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/02/2022] [Accepted: 08/09/2023] [Indexed: 09/02/2023] Open
Abstract
Transposable elements (TEs) are a major source of genetic polymorphisms and play a role in chromatin architecture, gene regulatory networks, and genomic evolution. However, their functional role in pigs and contributions to complex traits are largely unknown. We created a catalog of TEs (n = 3,087,929) in pigs and found that young SINEs were predominantly silenced by histone modifications, DNA methylation, and decreased accessibility. However, some transcripts from active young SINEs showed high tissue-specificity, as confirmed by analyzing 3570 RNA-seq samples. We also detected 211,067 dimorphic SINEs in 374 individuals, including 340 population-specific ones associated with local adaptation. Mapping these dimorphic SINEs to genome-wide associations of 97 complex traits in pigs, we found 54 candidate genes (e.g., ANK2 and VRTN) that might be mediated by TEs. Our findings highlight the important roles of young SINEs and provide a supplement for genotype-to-phenotype associations and modern breeding in pigs.
Collapse
Affiliation(s)
- Pengju Zhao
- Hainan Institute, Zhejiang University, Yongyou Industry Park, Yazhou Bay Sci-Tech City, Sanya, 572000, China
- College of Animal Sciences, Zhejiang University, Hangzhou, Zhejiang, 310058, China
| | - Lihong Gu
- Institute of Animal Science & Veterinary Medicine, Hainan Academy of Agricultural Sciences, No. 14 Xingdan Road, Haikou, 571100, China
| | - Yahui Gao
- Animal Genomics and Improvement Laboratory, Beltsville Agricultural Research Center, Agricultural Research Service, USDA, Beltsville, MD, 20705, USA
| | - Zhangyuan Pan
- Department of Animal Science, University of California, Davis, CA, 95616, USA
| | - Lei Liu
- Shenzhen Branch, Guangdong Laboratory of Lingnan Modern Agriculture, Genome Analysis Laboratory of the Ministry of Agriculture and Rural Affairs, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences, Shenzhen, 518124, China
| | - Xingzheng Li
- Shenzhen Branch, Guangdong Laboratory of Lingnan Modern Agriculture, Genome Analysis Laboratory of the Ministry of Agriculture and Rural Affairs, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences, Shenzhen, 518124, China
| | - Huaijun Zhou
- Department of Animal Science, University of California, Davis, CA, 95616, USA
| | - Dongyou Yu
- Hainan Institute, Zhejiang University, Yongyou Industry Park, Yazhou Bay Sci-Tech City, Sanya, 572000, China
- College of Animal Sciences, Zhejiang University, Hangzhou, Zhejiang, 310058, China
| | - Xinyan Han
- Hainan Institute, Zhejiang University, Yongyou Industry Park, Yazhou Bay Sci-Tech City, Sanya, 572000, China
- College of Animal Sciences, Zhejiang University, Hangzhou, Zhejiang, 310058, China
| | - Lichun Qian
- Hainan Institute, Zhejiang University, Yongyou Industry Park, Yazhou Bay Sci-Tech City, Sanya, 572000, China
- College of Animal Sciences, Zhejiang University, Hangzhou, Zhejiang, 310058, China
| | - George E Liu
- Animal Genomics and Improvement Laboratory, Beltsville Agricultural Research Center, Agricultural Research Service, USDA, Beltsville, MD, 20705, USA.
| | - Lingzhao Fang
- Center for Quantitative Genetics and Genomics, Aarhus University, Aarhus, 8000, Denmark.
| | - Zhengguang Wang
- Hainan Institute, Zhejiang University, Yongyou Industry Park, Yazhou Bay Sci-Tech City, Sanya, 572000, China.
- College of Animal Sciences, Zhejiang University, Hangzhou, Zhejiang, 310058, China.
| |
Collapse
|
3
|
Chen X, Bourque G, Goubert C. Genotyping of Transposable Element Insertions Segregating in Human Populations Using Short-Read Realignments. Methods Mol Biol 2023; 2607:63-83. [PMID: 36449158 DOI: 10.1007/978-1-0716-2883-6_4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/17/2023]
Abstract
Transposable element (TE) insertions are a major source of structural variation in the human genome. Due to the repetitive nature and biological importance of TEs, many bioinformatic tools have been developed to identify and genotype TE insertion polymorphisms using high-throughput short-reads. In this chapter, we outline recently developed methods to characterize TE insertion polymorphisms in human populations. We also provide detailed protocols to tackle this question primarily using three software: MELT2, ERVcaller, and TypeREF.
Collapse
Affiliation(s)
- Xun Chen
- Institute for the Advanced Study of Human Biology (ASHBi), Kyoto University, Kyoto, Japan.
| | - Guillaume Bourque
- Institute for the Advanced Study of Human Biology (ASHBi), Kyoto University, Kyoto, Japan
- Canadian Centre for Computational Genomics, McGill University, Montreal, QC, Canada
- McGill Genome Centre, Montreal, QC, Canada
- Human Genetics, McGill University, Montreal, QC, Canada
| | - Clément Goubert
- Canadian Centre for Computational Genomics, McGill University, Montreal, QC, Canada.
- McGill Genome Centre, Montreal, QC, Canada.
- Human Genetics, McGill University, Montreal, QC, Canada.
| |
Collapse
|
4
|
Kabiljo R, Bowles H, Marriott H, Jones AR, Bouton CR, Dobson RJ, Quinn JP, Al Khleifat A, Swanson CM, Al-Chalabi A, Iacoangeli A. RetroSnake: A modular pipeline to detect human endogenous retroviruses in genome sequencing data. iScience 2022; 25:105289. [PMID: 36339261 PMCID: PMC9626663 DOI: 10.1016/j.isci.2022.105289] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/03/2022] [Revised: 08/08/2022] [Accepted: 10/04/2022] [Indexed: 12/02/2022] Open
Abstract
Human endogenous retroviruses (HERVs) integrated into the human genome as a result of ancient exogenous infections and currently comprise ∼8% of our genome. The members of the most recently acquired HERV family, HERV-Ks, still retain the potential to produce viral molecules and have been linked to a wide range of diseases including cancer and neurodegeneration. Although a range of tools for HERV detection in NGS data exist, most of them lack wet lab validation and they do not cover all steps of the analysis. Here, we describe RetroSnake, an end-to-end, modular, computationally efficient, and customizable pipeline for the discovery of HERVs in short-read NGS data. RetroSnake is based on an extensively wet-lab validated protocol, it covers all steps of the analysis from raw data to the generation of annotated results presented as an interactive html file, and it is easy to use by life scientists without substantial computational training. Availability and implementation: The Pipeline and an extensive documentation are available on GitHub.
Collapse
Affiliation(s)
- Renata Kabiljo
- Department of Biostatistics and Health Informatics, Institute of Psychiatry, Psychology & Neuroscience, King’s College London, London SE5 8AF, UK
- Department of Basic and Clinical Neuroscience, Maurice Wohl Clinical Neuroscience Institute, Institute of Psychiatry, Psychology and Neuroscience, King’s College London, London SE5 9NU, UK
| | - Harry Bowles
- Department of Biostatistics and Health Informatics, Institute of Psychiatry, Psychology & Neuroscience, King’s College London, London SE5 8AF, UK
- Department of Basic and Clinical Neuroscience, Maurice Wohl Clinical Neuroscience Institute, Institute of Psychiatry, Psychology and Neuroscience, King’s College London, London SE5 9NU, UK
| | - Heather Marriott
- Department of Biostatistics and Health Informatics, Institute of Psychiatry, Psychology & Neuroscience, King’s College London, London SE5 8AF, UK
- Department of Basic and Clinical Neuroscience, Maurice Wohl Clinical Neuroscience Institute, Institute of Psychiatry, Psychology and Neuroscience, King’s College London, London SE5 9NU, UK
| | - Ashley R. Jones
- Department of Basic and Clinical Neuroscience, Maurice Wohl Clinical Neuroscience Institute, Institute of Psychiatry, Psychology and Neuroscience, King’s College London, London SE5 9NU, UK
| | - Clement R. Bouton
- Department of Infectious Diseases, School of Immunology and Microbial Sciences, King’s College London, London, UK
| | - Richard J.B. Dobson
- Department of Biostatistics and Health Informatics, Institute of Psychiatry, Psychology & Neuroscience, King’s College London, London SE5 8AF, UK
- NIHR Biomedical Research Centre at South London and Maudsley NHS Foundation Trust and King’s College London, London, UK
- Institute of Health Informatics, University College London, London, UK
- NIHR Biomedical Research Centre at University College London Hospitals NHS Foundation Trust, London, UK
| | - John P. Quinn
- Department of Pharmacology and Therapeutics, Institute of Systems, Molecular and Integrative Biology, University of Liverpool, Liverpool L69 3BX, UK
| | - Ahmad Al Khleifat
- Department of Basic and Clinical Neuroscience, Maurice Wohl Clinical Neuroscience Institute, Institute of Psychiatry, Psychology and Neuroscience, King’s College London, London SE5 9NU, UK
| | - Chad M. Swanson
- Department of Infectious Diseases, School of Immunology and Microbial Sciences, King’s College London, London, UK
| | - Ammar Al-Chalabi
- Department of Basic and Clinical Neuroscience, Maurice Wohl Clinical Neuroscience Institute, Institute of Psychiatry, Psychology and Neuroscience, King’s College London, London SE5 9NU, UK
| | - Alfredo Iacoangeli
- Department of Biostatistics and Health Informatics, Institute of Psychiatry, Psychology & Neuroscience, King’s College London, London SE5 8AF, UK
- Department of Basic and Clinical Neuroscience, Maurice Wohl Clinical Neuroscience Institute, Institute of Psychiatry, Psychology and Neuroscience, King’s College London, London SE5 9NU, UK
- NIHR Biomedical Research Centre at South London and Maudsley NHS Foundation Trust and King’s College London, London, UK
| |
Collapse
|
5
|
Marchi E, Jones M, Klenerman P, Frater J, Magiorkinis G, Belshaw R. BreakAlign: a Perl program to align chimaeric (split) genomic NGS reads and allow visual confirmation of novel retroviral integrations. BMC Bioinformatics 2022; 23:134. [PMID: 35428171 PMCID: PMC9013057 DOI: 10.1186/s12859-022-04621-1] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/04/2021] [Accepted: 02/28/2022] [Indexed: 12/20/2022] Open
Abstract
BACKGROUND Retroviruses replicate by integrating a DNA copy into a host chromosome. Detecting novel retroviral integrations (ones not in the reference genome sequence of the host) from genomic NGS data is bioinformatically challenging and frequently produces many false positives. One common method of confirmation is visual inspection of an alignment of the chimaeric (split) reads that span a putative novel retroviral integration site. We perceived the need for a program that would facilitate this by producing a multiple alignment containing both the viral and host regions that flank an integration. RESULTS BreakAlign is a Perl program that uses blastn to produce such a multiple alignment. In addition to the NGS dataset and a reference viral sequence, the program requires either (a) the ~ 500nt host genome sequence that spans the putative integration or (b) coordinates of this putative integration in an installed copy of the reference human genome (multiple integrations can be processed automatically). BreakAlign is freely available from https://github.com/marchiem/breakalign and is accompanied by example files allowing a test run. CONCLUSION BreakAlign will confirm and facilitate characterisation of both (a) germline integrations of endogenous retroviruses and (b) somatic integrations of exogenous retroviruses such as HIV and HTLV. Although developed for use with genomic short-read NGS (second generation) data and retroviruses, it should also be useful for long-read (third generation) data and any mobile element with at least one conserved flanking region.
Collapse
Affiliation(s)
- Emanuele Marchi
- Nuffield Department of Medicine, University of Oxford, Oxford, UK.
| | - Mathew Jones
- Nuffield Department of Medicine, University of Oxford, Oxford, UK
| | - Paul Klenerman
- Nuffield Department of Medicine, University of Oxford, Oxford, UK
| | - John Frater
- Nuffield Department of Medicine, University of Oxford, Oxford, UK
| | - Gkikas Magiorkinis
- Department of Hygiene, Epidemiology and Medical Statistics, Medical School, National and Kapodistrian University of Athens, Athens, Greece
| | - Robert Belshaw
- Department of Biology, College of Science and Technology, Wenzhou-Kean University, Wenzhou, Zhejiang Province, China.
| |
Collapse
|
6
|
Bowles H, Kabiljo R, Al Khleifat A, Jones A, Quinn JP, Dobson RJB, Swanson CM, Al-Chalabi A, Iacoangeli A. An assessment of bioinformatics tools for the detection of human endogenous retroviral insertions in short-read genome sequencing data. FRONTIERS IN BIOINFORMATICS 2022; 2:1062328. [PMID: 36845320 PMCID: PMC9945273 DOI: 10.3389/fbinf.2022.1062328] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/05/2022] [Accepted: 12/12/2022] [Indexed: 02/10/2023] Open
Abstract
There is a growing interest in the study of human endogenous retroviruses (HERVs) given the substantial body of evidence that implicates them in many human diseases. Although their genomic characterization presents numerous technical challenges, next-generation sequencing (NGS) has shown potential to detect HERV insertions and their polymorphisms in humans. Currently, a number of computational tools to detect them in short-read NGS data exist. In order to design optimal analysis pipelines, an independent evaluation of the available tools is required. We evaluated the performance of a set of such tools using a variety of experimental designs and datasets. These included 50 human short-read whole-genome sequencing samples, matching long and short-read sequencing data, and simulated short-read NGS data. Our results highlight a great performance variability of the tools across the datasets and suggest that different tools might be suitable for different study designs. However, specialized tools designed to detect exclusively human endogenous retroviruses consistently outperformed generalist tools that detect a wider range of transposable elements. We suggest that, if sufficient computing resources are available, using multiple HERV detection tools to obtain a consensus set of insertion loci may be ideal. Furthermore, given that the false positive discovery rate of the tools varied between 8% and 55% across tools and datasets, we recommend the wet lab validation of predicted insertions if DNA samples are available.
Collapse
Affiliation(s)
- Harry Bowles
- Department of Basic and Clinical Neuroscience, King’s College London, Maurice Wohl Clinical Neuroscience Institute, Institute of Psychiatry, Psychology and Neuroscience, London, United Kingdom
| | - Renata Kabiljo
- Department of Basic and Clinical Neuroscience, King’s College London, Maurice Wohl Clinical Neuroscience Institute, Institute of Psychiatry, Psychology and Neuroscience, London, United Kingdom
- Department of Biostatistics and Health Informatics, King’s College London, Institute of Psychiatry, Psychology and Neuroscience, London, United Kingdom
| | - Ahmad Al Khleifat
- Department of Basic and Clinical Neuroscience, King’s College London, Maurice Wohl Clinical Neuroscience Institute, Institute of Psychiatry, Psychology and Neuroscience, London, United Kingdom
| | - Ashley Jones
- Department of Basic and Clinical Neuroscience, King’s College London, Maurice Wohl Clinical Neuroscience Institute, Institute of Psychiatry, Psychology and Neuroscience, London, United Kingdom
| | - John P. Quinn
- Department of Pharmacology and Therapeutics, Institute of Systems, Molecular and Integrative Biology, University of Liverpool, Liverpool, United Kingdom
| | - Richard J. B. Dobson
- Department of Biostatistics and Health Informatics, King’s College London, Institute of Psychiatry, Psychology and Neuroscience, London, United Kingdom
- NIHR Biomedical Research Centre at South London and Maudsley NHS Foundation Trust and King’s College London, London, United Kingdom
- Institute of Health Informatics, University College London, London, United Kingdom
- NIHR Biomedical Research Centre, University College London Hospitals NHS Foundation Trust, London, United Kingdom
| | - Chad M. Swanson
- Department of Infectious Diseases, School of Immunology and Microbial Sciences, King’s College London, London, United Kingdom
| | - Ammar Al-Chalabi
- Department of Basic and Clinical Neuroscience, King’s College London, Maurice Wohl Clinical Neuroscience Institute, Institute of Psychiatry, Psychology and Neuroscience, London, United Kingdom
- Department of Neurology, King’s College Hospital, London, United Kingdom
| | - Alfredo Iacoangeli
- Department of Basic and Clinical Neuroscience, King’s College London, Maurice Wohl Clinical Neuroscience Institute, Institute of Psychiatry, Psychology and Neuroscience, London, United Kingdom
- Department of Biostatistics and Health Informatics, King’s College London, Institute of Psychiatry, Psychology and Neuroscience, London, United Kingdom
- NIHR Biomedical Research Centre at South London and Maudsley NHS Foundation Trust and King’s College London, London, United Kingdom
- *Correspondence: Alfredo Iacoangeli,
| |
Collapse
|
7
|
Bogaerts-Márquez M, Barrón MG, Fiston-Lavier AS, Vendrell-Mir P, Castanera R, Casacuberta JM, González J. T-lex3: an accurate tool to genotype and estimate population frequencies of transposable elements using the latest short-read whole genome sequencing data. Bioinformatics 2020; 36:1191-1197. [PMID: 31580402 PMCID: PMC7703783 DOI: 10.1093/bioinformatics/btz727] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/28/2019] [Revised: 09/16/2019] [Accepted: 09/25/2019] [Indexed: 12/22/2022] Open
Abstract
Motivation Transposable elements (TEs) constitute a significant proportion of the majority of genomes sequenced to date. TEs are responsible for a considerable fraction of the genetic variation within and among species. Accurate genotyping of TEs in genomes is therefore crucial for a complete identification of the genetic differences among individuals, populations and species. Results In this work, we present a new version of T-lex, a computational pipeline that accurately genotypes and estimates the population frequencies of reference TE insertions using short-read high-throughput sequencing data. In this new version, we have re-designed the T-lex algorithm to integrate the BWA-MEM short-read aligner, which is one of the most accurate short-read mappers and can be launched on longer short-reads (e.g. reads >150 bp). We have added new filtering steps to increase the accuracy of the genotyping, and new parameters that allow the user to control both the minimum and maximum number of reads, and the minimum number of strains to genotype a TE insertion. We also showed for the first time that T-lex3 provides accurate TE calls in a plant genome. Availability and implementation To test the accuracy of T-lex3, we called 1630 individual TE insertions in Drosophila melanogaster, 1600 individual TE insertions in humans, and 3067 individual TE insertions in the rice genome. We showed that this new version of T-lex is a broadly applicable and accurate tool for genotyping and estimating TE frequencies in organisms with different genome sizes and different TE contents. T-lex3 is available at Github: https://github.com/GonzalezLab/T-lex3. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- María Bogaerts-Márquez
- Institute of Evolutionary Biology (CSIC-Universitat Pompeu Fabra), Paseo Maritimo Barceloneta 37-49, Barcelona, Spain
| | - Maite G Barrón
- Institute of Evolutionary Biology (CSIC-Universitat Pompeu Fabra), Paseo Maritimo Barceloneta 37-49, Barcelona, Spain
| | - Anna-Sophie Fiston-Lavier
- Institut des Sciences de l'Evolution de Montpellier (UMR 5554, CNRS-UM-IRD-EPHE), 11 Université de Motpellier, Place Eugène Bataillon, Montpellier, France
| | - Pol Vendrell-Mir
- Center for Research in Agricultural Genomics, CRAG (CSIC-IRTA-UAB-UB), Campus UAB, Cerdanyola del Vallès, Barcelona, Spain
| | - Raúl Castanera
- Center for Research in Agricultural Genomics, CRAG (CSIC-IRTA-UAB-UB), Campus UAB, Cerdanyola del Vallès, Barcelona, Spain
| | - Josep M Casacuberta
- Center for Research in Agricultural Genomics, CRAG (CSIC-IRTA-UAB-UB), Campus UAB, Cerdanyola del Vallès, Barcelona, Spain
| | - Josefa González
- Institute of Evolutionary Biology (CSIC-Universitat Pompeu Fabra), Paseo Maritimo Barceloneta 37-49, Barcelona, Spain
| |
Collapse
|
8
|
Chen X, Li D. ERVcaller: identifying polymorphic endogenous retrovirus and other transposable element insertions using whole-genome sequencing data. Bioinformatics 2020; 35:3913-3922. [PMID: 30895294 DOI: 10.1093/bioinformatics/btz205] [Citation(s) in RCA: 18] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/11/2018] [Revised: 02/28/2019] [Accepted: 03/19/2019] [Indexed: 12/12/2022] Open
Abstract
MOTIVATION Approximately 8% of the human genome is derived from endogenous retroviruses (ERVs). In recent years, an increasing number of human diseases have been found to be associated with ERVs. However, it remains challenging to accurately detect the full spectrum of polymorphic (unfixed) ERVs using whole-genome sequencing (WGS) data. RESULTS We designed a new tool, ERVcaller, to detect and genotype transposable element (TE) insertions, including ERVs, in the human genome. We evaluated ERVcaller using both simulated and real benchmark WGS datasets. Compared to existing tools, ERVcaller consistently obtained both the highest sensitivity and precision for detecting simulated ERV and other TE insertions derived from real polymorphic TE sequences. For the WGS data from the 1000 Genomes Project, ERVcaller detected the largest number of TE insertions per sample based on consensus TE loci. By analyzing the experimentally verified TE insertions, ERVcaller had 94.0% TE detection sensitivity and 96.6% genotyping accuracy. Polymerase chain reaction and Sanger sequencing in a small sample set verified 86.7% of examined insertion statuses and 100% of examined genotypes. In conclusion, ERVcaller is capable of detecting and genotyping TE insertions using WGS data with both high sensitivity and precision. This tool can be applied broadly to other species. AVAILABILITY AND IMPLEMENTATION http://www.uvm.edu/genomics/software/ERVcaller.html. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Xun Chen
- Department of Microbiology and Molecular Genetics, University of Vermont, Burlington, VT, USA
| | - Dawei Li
- Department of Microbiology and Molecular Genetics, University of Vermont, Burlington, VT, USA.,Neuroscience, Behavior, and Health Initiative, University of Vermont, Burlington, VT, USA.,Department of Computer Science, University of Vermont, Burlington, VT, USA
| |
Collapse
|
9
|
Goubert C, Thomas J, Payer LM, Kidd JM, Feusier J, Watkins WS, Burns KH, Jorde LB, Feschotte C. TypeTE: a tool to genotype mobile element insertions from whole genome resequencing data. Nucleic Acids Res 2020; 48:e36. [PMID: 32067044 PMCID: PMC7102983 DOI: 10.1093/nar/gkaa074] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/20/2019] [Revised: 01/08/2020] [Accepted: 02/11/2020] [Indexed: 12/12/2022] Open
Abstract
Alu retrotransposons account for more than 10% of the human genome, and insertions of these elements create structural variants segregating in human populations. Such polymorphic Alus are powerful markers to understand population structure, and they represent variants that can greatly impact genome function, including gene expression. Accurate genotyping of Alus and other mobile elements has been challenging. Indeed, we found that Alu genotypes previously called for the 1000 Genomes Project are sometimes erroneous, which poses significant problems for phasing these insertions with other variants that comprise the haplotype. To ameliorate this issue, we introduce a new pipeline - TypeTE - which genotypes Alu insertions from whole-genome sequencing data. Starting from a list of polymorphic Alus, TypeTE identifies the hallmarks (poly-A tail and target site duplication) and orientation of Alu insertions using local re-assembly to reconstruct presence and absence alleles. Genotype likelihoods are then computed after re-mapping sequencing reads to the reconstructed alleles. Using a high-quality set of PCR-based genotyping of >200 loci, we show that TypeTE improves genotype accuracy from 83% to 92% in the 1000 Genomes dataset. TypeTE can be readily adapted to other retrotransposon families and brings a valuable toolbox addition for population genomics.
Collapse
Affiliation(s)
- Clément Goubert
- Department of Molecular Biology and Genetics, 215 Tower Rd, Cornell University, Ithaca, NY 14853, USA
| | - Jainy Thomas
- Department of Human Genetics, University of Utah School of Medicine, Salt Lake City, UT 84112, USA
| | - Lindsay M Payer
- Department of Pathology, Johns Hopkins University School of Medicine, Baltimore, MD 21205, USA
| | - Jeffrey M Kidd
- Department of Human Genetics, University of Michigan Medical School, Ann Arbor, MI 48109, USA
| | - Julie Feusier
- Department of Human Genetics, University of Utah School of Medicine, Salt Lake City, UT 84112, USA
| | - W Scott Watkins
- Department of Human Genetics, University of Utah School of Medicine, Salt Lake City, UT 84112, USA
| | - Kathleen H Burns
- Department of Pathology, Johns Hopkins University School of Medicine, Baltimore, MD 21205, USA
| | - Lynn B Jorde
- Department of Human Genetics, University of Utah School of Medicine, Salt Lake City, UT 84112, USA
| | - Cédric Feschotte
- Department of Molecular Biology and Genetics, 215 Tower Rd, Cornell University, Ithaca, NY 14853, USA
| |
Collapse
|
10
|
Xue B, Zeng T, Jia L, Yang D, Lin SL, Sechi LA, Kelvin DJ. Identification of the distribution of human endogenous retroviruses K (HML-2) by PCR-based target enrichment sequencing. Retrovirology 2020; 17:10. [PMID: 32375827 PMCID: PMC7201656 DOI: 10.1186/s12977-020-00519-z] [Citation(s) in RCA: 15] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/16/2020] [Accepted: 04/23/2020] [Indexed: 02/05/2023] Open
Abstract
BACKGROUND Human endogenous retroviruses (HERVs), suspected to be transposition-defective, may reshape the transcriptional network of the human genome by regulatory elements distributed in their long terminal repeats (LTRs). HERV-K (HML-2), the most preserved group with the least number of accumulated of mutations, has been associated with aberrant gene expression in tumorigenesis and autoimmune diseases. Because of the high sequence similarity between different HERV-Ks, current methods have limitations in providing genome-wide mapping specific for individual HERV-K (HML-2) members, a major barrier in delineating HERV-K (HML-2) function. RESULTS In an attempt to obtain detailed distribution information of HERV-K (HML-2), we utilized a PCR-based target enrichment sequencing protocol for HERV-K (HML-2) (PTESHK) loci, which not only maps the presence of reference loci, but also identifies non-reference loci, enabling determination of the genome-wide distribution of HERV-K (HML-2) loci. Here we report on the genomic data obtained from three individuals. We identified a total of 978 loci using this method, including 30 new reference loci and 5 non-reference loci. Among the 3 individuals in our study, 14 polymorphic HERV-K (HML-2) loci were identified, and solo-LTR330 and N6p21.32 were identified as polymorphic for the first time. CONCLUSIONS Interestingly, PTESHK provides an approach for the identification of the genome-wide distribution of HERV-K (HML-2) and can be used for the identification of polymorphic loci. Since polymorphic HERV-K (HML-2) integrations are suspected to be related to various diseases, PTESHK can supplement other emerging techniques in accessing polymorphic HERV-K (HML-2) elements in cancer and autoimmune diseases.
Collapse
Affiliation(s)
- Bei Xue
- Division of Immunology, Shantou University Medical College, Shantou, China
- The Department of Microbiology and Immunology, Dalhousie University, Halifax, Canada
- Canadian Center for Vaccinology, Dalhousie University, Halifax, Canada
| | - Tiansheng Zeng
- Division of Immunology, Shantou University Medical College, Shantou, China
- Department of Biomedical Sciences, University of Sassari, Sassari, Italy
| | - Lisha Jia
- Division of Immunology, Shantou University Medical College, Shantou, China
| | - Dongsheng Yang
- Division of Immunology, Shantou University Medical College, Shantou, China
| | - Stanley L Lin
- Division of Immunology, Shantou University Medical College, Shantou, China
| | - Leonardo A Sechi
- Department of Biomedical Sciences, University of Sassari, Sassari, Italy.
| | - David J Kelvin
- Division of Immunology, Shantou University Medical College, Shantou, China.
- The Department of Microbiology and Immunology, Dalhousie University, Halifax, Canada.
- Canadian Center for Vaccinology, Dalhousie University, Halifax, Canada.
- Department of Biomedical Sciences, University of Sassari, Sassari, Italy.
| |
Collapse
|
11
|
Puurand T, Kukuškina V, Pajuste FD, Remm M. AluMine: alignment-free method for the discovery of polymorphic Alu element insertions. Mob DNA 2019; 10:31. [PMID: 31360240 PMCID: PMC6639938 DOI: 10.1186/s13100-019-0174-3] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/21/2019] [Accepted: 07/12/2019] [Indexed: 01/09/2023] Open
Abstract
Background Recently, alignment-free sequence analysis methods have gained popularity in the field of personal genomics. These methods are based on counting frequencies of short k-mer sequences, thus allowing faster and more robust analysis compared to traditional alignment-based methods. Results We have created a fast alignment-free method, AluMine, to analyze polymorphic insertions of Alu elements in the human genome. We tested the method on 2,241 individuals from the Estonian Genome Project and identified 28,962 potential polymorphic Alu element insertions. Each tested individual had on average 1,574 Alu element insertions that were different from those in the reference genome. In addition, we propose an alignment-free genotyping method that uses the frequency of insertion/deletion-specific 32-mer pairs to call the genotype directly from raw sequencing reads. Using this method, the concordance between the predicted and experimentally observed genotypes was 98.7%. The running time of the discovery pipeline is approximately 2 h per individual. The genotyping of potential polymorphic insertions takes between 0.4 and 4 h per individual, depending on the hardware configuration. Conclusions AluMine provides tools that allow discovery of novel Alu element insertions and/or genotyping of known Alu element insertions from personal genomes within few hours.
Collapse
Affiliation(s)
- Tarmo Puurand
- Institute of Molecular and Cell Biology, University of Tartu, Tartu, Estonia
| | - Viktoria Kukuškina
- Institute of Molecular and Cell Biology, University of Tartu, Tartu, Estonia
| | | | - Maido Remm
- Institute of Molecular and Cell Biology, University of Tartu, Tartu, Estonia
| |
Collapse
|
12
|
Bourgeois Y, Boissinot S. On the Population Dynamics of Junk: A Review on the Population Genomics of Transposable Elements. Genes (Basel) 2019; 10:genes10060419. [PMID: 31151307 PMCID: PMC6627506 DOI: 10.3390/genes10060419] [Citation(s) in RCA: 67] [Impact Index Per Article: 13.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/04/2019] [Revised: 05/05/2019] [Accepted: 05/21/2019] [Indexed: 01/18/2023] Open
Abstract
Transposable elements (TEs) play an important role in shaping genomic organization and structure, and may cause dramatic changes in phenotypes. Despite the genetic load they may impose on their host and their importance in microevolutionary processes such as adaptation and speciation, the number of population genetics studies focused on TEs has been rather limited so far compared to single nucleotide polymorphisms (SNPs). Here, we review the current knowledge about the dynamics of transposable elements at recent evolutionary time scales, and discuss the mechanisms that condition their abundance and frequency. We first discuss non-adaptive mechanisms such as purifying selection and the variable rates of transposition and elimination, and then focus on positive and balancing selection, to finally conclude on the potential role of TEs in causing genomic incompatibilities and eventually speciation. We also suggest possible ways to better model TEs dynamics in a population genomics context by incorporating recent advances in TEs into the rich information provided by SNPs about the demography, selection, and intrinsic properties of genomes.
Collapse
Affiliation(s)
- Yann Bourgeois
- New York University Abu Dhabi, P.O. 129188, Saadiyat Island, Abu Dhabi, United Arab Emirates.
| | - Stéphane Boissinot
- New York University Abu Dhabi, P.O. 129188, Saadiyat Island, Abu Dhabi, United Arab Emirates.
| |
Collapse
|
13
|
Su W, Gu X, Peterson T. TIR-Learner, a New Ensemble Method for TIR Transposable Element Annotation, Provides Evidence for Abundant New Transposable Elements in the Maize Genome. MOLECULAR PLANT 2019; 12:447-460. [PMID: 30802553 DOI: 10.1016/j.molp.2019.02.008] [Citation(s) in RCA: 53] [Impact Index Per Article: 10.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 09/02/2018] [Revised: 02/19/2019] [Accepted: 02/19/2019] [Indexed: 05/21/2023]
Abstract
Transposable elements (TEs) make up a large and rapidly evolving proportion of plant genomes. Among Class II DNA TEs, TIR elements are flanked by characteristic terminal inverted repeat sequences (TIRs). TIR TEs may play important roles in genome evolution, including generating allelic diversity, inducing structural variation, and regulating gene expression. However, TIR TE identification and annotation has been hampered by the lack of effective tools, resulting in erroneous TE annotations and a significant underestimation of the proportion of TIR elements in the maize genome. This problem has largely limited our understanding of the impact of TIR elements on plant genome structure and evolution. In this paper, we propose a new method of TIR element detection and annotation. This new pipeline combines the advantages of current homology-based annotation methods with powerful de novo machine-learning approaches, resulting in greatly increased efficiency and accuracy of TIR element annotation. The results show that the copy number and genome proportion of TIR elements in maize is much larger than that of current annotations. In addition, the distribution of some TIR superfamily elements is reduced in centromeric and pericentromeric positions, while others do not show a similar bias. Finally, the incorporation of machine-learning techniques has enabled the identification of large numbers of new DTA (hAT) family elements, which have all the hallmarks of bona fide TEs yet which lack high homology with currently known DTA elements. Together, these results provide new tools for TE research and new insight into the impact of TIR elements on maize genome diversity.
Collapse
Affiliation(s)
- Weijia Su
- Department of Genetics, Development and Cell Biology, Iowa State University, Ames, IA 50011-3260, USA
| | - Xun Gu
- Department of Genetics, Development and Cell Biology, Iowa State University, Ames, IA 50011-3260, USA
| | - Thomas Peterson
- Department of Genetics, Development and Cell Biology, Iowa State University, Ames, IA 50011-3260, USA; Department of Agronomy, Iowa State University, Ames, IA 50011-3260, USA.
| |
Collapse
|
14
|
Thomas J, Perron H, Feschotte C. Variation in proviral content among human genomes mediated by LTR recombination. Mob DNA 2018; 9:36. [PMID: 30568734 PMCID: PMC6298018 DOI: 10.1186/s13100-018-0142-3] [Citation(s) in RCA: 49] [Impact Index Per Article: 8.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/17/2018] [Accepted: 11/29/2018] [Indexed: 01/23/2023] Open
Abstract
Background Human endogenous retroviruses (HERVs) occupy a substantial fraction of the genome and impact cellular function with both beneficial and deleterious consequences. The vast majority of HERV sequences descend from ancient retroviral families no longer capable of infection or genomic propagation. In fact, most are no longer represented by full-length proviruses but by solitary long terminal repeats (solo LTRs) that arose via non-allelic recombination events between the two LTRs of a proviral insertion. Because LTR-LTR recombination events may occur long after proviral insertion but are challenging to detect in resequencing data, we hypothesize that this mechanism is a source of genomic variation in the human population that remains vastly underestimated. Results We developed a computational pipeline specifically designed to capture dimorphic proviral/solo HERV allelic variants from short-read genome sequencing data. When applied to 279 individuals sequenced as part of the Simons Genome Diversity Project, the pipeline retrieves most of the dimorphic loci previously reported for the HERV-K(HML2) subfamily as well as dozens of additional candidates, including members of the HERV-H and HERV-W families previously involved in human development and disease. We experimentally validate several of these newly discovered dimorphisms, including the first reported instance of an unfixed HERV-W provirus and an HERV-H locus driving a transcript (ESRG) implicated in the maintenance of embryonic stem cell pluripotency. Conclusions Our findings indicate that human proviral content exhibit more extensive interindividual variation than previously recognized, which has important bearings for deciphering the contribution of HERVs to human physiology and disease. Because LTR retroelements and LTR recombination are ubiquitous in eukaryotes, our computational pipeline should facilitate the mapping of this type of genomic variation for a wide range of organisms. Electronic supplementary material The online version of this article (10.1186/s13100-018-0142-3) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Jainy Thomas
- 1Department of Human Genetics, University of Utah School of Medicine, 15 North 2030 East, Rm 5100, Salt Lake City, UT 84112 USA
| | - Hervé Perron
- GeNeuro, Plan-les-Ouates, Geneva, Switzerland.,3Université Claude Bernard, Lyon, France
| | - Cédric Feschotte
- 4Department of Molecular Biology and Genetics, Cornell University, 107 Biotechnology Building, Ithaca, NY 14853 USA
| |
Collapse
|
15
|
Wang L, Jordan IK. Transposable element activity, genome regulation and human health. Curr Opin Genet Dev 2018; 49:25-33. [PMID: 29505964 DOI: 10.1016/j.gde.2018.02.006] [Citation(s) in RCA: 15] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/11/2017] [Revised: 01/30/2018] [Accepted: 02/13/2018] [Indexed: 12/21/2022]
Abstract
A convergence of novel genome analysis technologies is enabling population genomic studies of human transposable elements (TEs). Population surveys of human genome sequences have uncovered thousands of individual TE insertions that segregate as common genetic variants, i.e. TE polymorphisms. These recent TE insertions provide an important source of naturally occurring human genetic variation. Investigators are beginning to leverage population genomic data sets to execute genome-scale association studies for assessing the phenotypic impact of human TE polymorphisms. For example, the expression quantitative trait loci (eQTL) analytical paradigm has recently been used to uncover hundreds of associations between human TE insertion variants and gene expression levels. These include population-specific gene regulatory effects as well as coordinated changes to gene regulatory networks. In addition, analyses of linkage disequilibrium patterns with previously characterized genome-wide association study (GWAS) trait variants have uncovered TE insertion polymorphisms that are likely causal variants for a variety of common complex diseases. Gene regulatory mechanisms that underlie specific disease phenotypes have been proposed for a number of these trait associated TE polymorphisms. These new population genomic approaches hold great promise for understanding how ongoing TE activity contributes to functionally relevant genetic variation within and between human populations.
Collapse
Affiliation(s)
- Lu Wang
- School of Biological Sciences, Georgia Institute of Technology, Atlanta, GA, USA; PanAmerican Bioinformatics Institute, Cali, Colombia
| | - I King Jordan
- School of Biological Sciences, Georgia Institute of Technology, Atlanta, GA, USA; PanAmerican Bioinformatics Institute, Cali, Colombia.
| |
Collapse
|