1
|
Shukla HG, Chakraborty M, Emerson J. Genetic variation in recalcitrant repetitive regions of the Drosophila melanogaster genome. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.06.11.598575. [PMID: 38915508 PMCID: PMC11195212 DOI: 10.1101/2024.06.11.598575] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/26/2024]
Abstract
Many essential functions of organisms are encoded in highly repetitive genomic regions, including histones involved in DNA packaging, centromeres that are core components of chromosome segregation, ribosomal RNA comprising the protein translation machinery, telomeres that ensure chromosome integrity, piRNA clusters encoding host defenses against selfish elements, and virtually the entire Y chromosome. These regions, formed by highly similar tandem arrays, pose significant challenges for experimental and informatic study, impeding sequence-level descriptions essential for understanding genetic variation. Here, we report the assembly and variation analysis of such repetitive regions in Drosophila melanogaster, offering significant improvements to the existing community reference assembly. Our work successfully recovers previously elusive segments, including complete reconstructions of the histone locus and the pericentric heterochromatin of the X chromosome, spanning the Stellate locus to the distal flank of the rDNA cluster. To infer structural changes in these regions where alignments are often not practicable, we introduce landmark anchors based on unique variants that are putatively orthologous. These regions display considerable structural variation between different D. melanogaster strains, exhibiting differences in copy number and organization of homologous repeat units between haplotypes. In the histone cluster, although we observe minimal genetic exchange indicative of crossing over, the variation patterns suggest mechanisms such as unequal sister chromatid exchange. We also examine the prevalence and scale of concerted evolution in the histone and Stellate clusters and discuss the mechanisms underlying these observed patterns.
Collapse
Affiliation(s)
- Harsh G. Shukla
- Department of Ecology and Evolutionary Biology, University of California Irvine, Irvine, California 92697, USA
- Graduate Program in Mathematical, Computational and Systems Biology, University of California Irvine, Irvine, California 92697, USA
| | - Mahul Chakraborty
- Department of Biology, Texas A&M University, College Station, Texas 77843, USA
| | - J.J. Emerson
- Department of Ecology and Evolutionary Biology, University of California Irvine, Irvine, California 92697, USA
- Center for Complex Biological Systems, University of California Irvine, Irvine, California 92697, USA
| |
Collapse
|
2
|
Arends T, Tsuchida H, Adeyemi RO, Tapscott SJ. DUX4-induced HSATII transcription causes KDM2A/B-PRC1 nuclear foci and impairs DNA damage response. J Cell Biol 2024; 223:e202303141. [PMID: 38451221 PMCID: PMC10919155 DOI: 10.1083/jcb.202303141] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/30/2023] [Revised: 11/02/2023] [Accepted: 02/01/2024] [Indexed: 03/08/2024] Open
Abstract
Polycomb repressive complexes regulate developmental gene programs, promote DNA damage repair, and mediate pericentromeric satellite repeat repression. Expression of pericentromeric satellite repeats has been implicated in several cancers and diseases, including facioscapulohumeral dystrophy (FSHD). Here, we show that DUX4-mediated transcription of HSATII regions causes nuclear foci formation of KDM2A/B-PRC1 complexes, resulting in a global loss of PRC1-mediated monoubiquitination of histone H2A. Loss of PRC1-ubiquitin signaling severely impacts DNA damage response. Our data implicate DUX4-activation of HSATII and sequestration of KDM2A/B-PRC1 complexes as a mechanism of regulating epigenetic and DNA repair pathways.
Collapse
Affiliation(s)
- Tessa Arends
- Human Biology Division, Fred Hutchinson Cancer Center, Seattle, WA, USA
| | - Hiroshi Tsuchida
- Basic Sciences Division, Fred Hutchinson Cancer Center, Seattle, WA, USA
| | - Richard O. Adeyemi
- Basic Sciences Division, Fred Hutchinson Cancer Center, Seattle, WA, USA
| | - Stephen J. Tapscott
- Human Biology Division, Fred Hutchinson Cancer Center, Seattle, WA, USA
- Clinical Research Division, Fred Hutchinson Cancer Center, Seattle, WA, USA
- Department of Neurology, University of Washington, Seattle, WA, USA
| |
Collapse
|
3
|
Westemeier-Rice ES, Winters MT, Rawson TW, Martinez I. More than the SRY: The Non-Coding Landscape of the Y Chromosome and Its Importance in Human Disease. Noncoding RNA 2024; 10:21. [PMID: 38668379 PMCID: PMC11054740 DOI: 10.3390/ncrna10020021] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/06/2024] [Revised: 03/31/2024] [Accepted: 04/08/2024] [Indexed: 04/29/2024] Open
Abstract
Historically, the Y chromosome has presented challenges to classical methodology and philosophy of understanding the differences between males and females. A genetic unsolved puzzle, the Y chromosome was the last chromosome to be fully sequenced. With the advent of the Human Genome Project came a realization that the human genome is more than just genes encoding proteins, and an entire universe of RNA was discovered. This dark matter of biology and the black box surrounding the Y chromosome have collided over the last few years, as increasing numbers of non-coding RNAs have been identified across the length of the Y chromosome, many of which have played significant roles in disease. In this review, we will uncover what is known about the connections between the Y chromosome and the non-coding RNA universe that originates from it, particularly as it relates to long non-coding RNAs, microRNAs and circular RNAs.
Collapse
Affiliation(s)
- Emily S. Westemeier-Rice
- West Virginia University Cancer Institute, West Virginia University School of Medicine, Morgantown, WV 26506, USA;
| | - Michael T. Winters
- Department of Microbiology, Immunology and Cell Biology, West Virginia University School of Medicine, Morgantown, WV 26506, USA; (M.T.W.); (T.W.R.)
| | - Travis W. Rawson
- Department of Microbiology, Immunology and Cell Biology, West Virginia University School of Medicine, Morgantown, WV 26506, USA; (M.T.W.); (T.W.R.)
| | - Ivan Martinez
- West Virginia University Cancer Institute, West Virginia University School of Medicine, Morgantown, WV 26506, USA;
- Department of Microbiology, Immunology and Cell Biology, West Virginia University School of Medicine, Morgantown, WV 26506, USA; (M.T.W.); (T.W.R.)
| |
Collapse
|
4
|
Ruzanov P, Evdokimova V, Pachva MC, Minkovich A, Zhang Z, Langman S, Gassmann H, Thiel U, Orlic-Milacic M, Zaidi SH, Peltekova V, Heisler LE, Sharma M, Cox ME, McKee TD, Zaidi M, Lapouble E, McPherson JD, Delattre O, Radvanyi L, Burdach SE, Stein LD, Sorensen PH. Oncogenic ETS fusions promote DNA damage and proinflammatory responses via pericentromeric RNAs in extracellular vesicles. J Clin Invest 2024; 134:e169470. [PMID: 38530366 PMCID: PMC11060741 DOI: 10.1172/jci169470] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/09/2023] [Accepted: 03/12/2024] [Indexed: 03/28/2024] Open
Abstract
Aberrant expression of the E26 transformation-specific (ETS) transcription factors characterizes numerous human malignancies. Many of these proteins, including EWS:FLI1 and EWS:ERG fusions in Ewing sarcoma (EwS) and TMPRSS2:ERG in prostate cancer (PCa), drive oncogenic programs via binding to GGAA repeats. We report here that both EWS:FLI1 and ERG bind and transcriptionally activate GGAA-rich pericentromeric heterochromatin. The respective pathogen-like HSAT2 and HSAT3 RNAs, together with LINE, SINE, ERV, and other repeat transcripts, are expressed in EwS and PCa tumors, secreted in extracellular vesicles (EVs), and are highly elevated in plasma of patients with EwS with metastatic disease. High human satellite 2 and 3 (HSAT2,3) levels in EWS:FLI1- or ERG-expressing cells and tumors were associated with induction of G2/M checkpoint, mitotic spindle, and DNA damage programs. These programs were also activated in EwS EV-treated fibroblasts, coincident with accumulation of HSAT2,3 RNAs, proinflammatory responses, mitotic defects, and senescence. Mechanistically, HSAT2,3-enriched cancer EVs induced cGAS-TBK1 innate immune signaling and formation of cytosolic granules positive for double-strand RNAs, RNA-DNA, and cGAS. Hence, aberrantly expressed ETS proteins derepress pericentromeric heterochromatin, yielding pathogenic RNAs that transmit genotoxic stress and inflammation to local and distant sites. Monitoring HSAT2,3 plasma levels and preventing their dissemination may thus improve therapeutic strategies and blood-based diagnostics.
Collapse
Affiliation(s)
- Peter Ruzanov
- Ontario Institute for Cancer Research, Toronto, Ontario, Canada
| | | | - Manideep C. Pachva
- Department of Molecular Oncology, British Columbia Cancer Research Centre and
- Department of Pathology and Laboratory Medicine, University of British Columbia, Vancouver, British Columbia, Canada
| | - Alon Minkovich
- Ontario Institute for Cancer Research, Toronto, Ontario, Canada
| | - Zhenbo Zhang
- Ontario Institute for Cancer Research, Toronto, Ontario, Canada
| | - Sofya Langman
- Department of Molecular Oncology, British Columbia Cancer Research Centre and
- Department of Pathology and Laboratory Medicine, University of British Columbia, Vancouver, British Columbia, Canada
| | - Hendrik Gassmann
- Department of Pediatrics, Children’s Cancer Research Center, Kinderklinik München Schwabing, TUM School of Medicine and Health, Technical University of Munich, Munich, Germany
| | - Uwe Thiel
- Department of Pediatrics, Children’s Cancer Research Center, Kinderklinik München Schwabing, TUM School of Medicine and Health, Technical University of Munich, Munich, Germany
| | | | - Syed H. Zaidi
- Ontario Institute for Cancer Research, Toronto, Ontario, Canada
| | - Vanya Peltekova
- Ontario Institute for Cancer Research, Toronto, Ontario, Canada
| | | | - Manju Sharma
- Vancouver Prostate Centre, Vancouver, British Columbia, Canada
| | - Michael E. Cox
- Vancouver Prostate Centre, Vancouver, British Columbia, Canada
| | - Trevor D. McKee
- STTARR Innovation Centre, Radiation Medicine Program, Princess Margaret Cancer Centre, University Health Network, Toronto, Ontario, Canada
- Pathomics Inc., Toronto, Ontario, Canada
| | - Mark Zaidi
- Pathomics Inc., Toronto, Ontario, Canada
- Department of Medical Biophysics, University of Toronto, Toronto, Ontario, Canada
| | - Eve Lapouble
- Unité Génétique Somatique (UGS), Institut Curie, Centre Hospitalier Paris, France
| | - John D. McPherson
- Ontario Institute for Cancer Research, Toronto, Ontario, Canada
- Department of Biochemistry and Molecular Medicine, University of California Davis Comprehensive Cancer Center, Sacramento, California, USA
| | - Olivier Delattre
- Unité Génétique Somatique (UGS), Institut Curie, Centre Hospitalier Paris, France
- Diversity and Plasticity of Childhood tumors, INSERM U830, Institut Curie Research Center, PSL Research University, Paris, France
| | - Laszlo Radvanyi
- Ontario Institute for Cancer Research, Toronto, Ontario, Canada
- Department of Immunology, University of Toronto, Toronto, Ontario, Canada
| | - Stefan E.G. Burdach
- Department of Molecular Oncology, British Columbia Cancer Research Centre and
- Department of Pediatrics, Children’s Cancer Research Center, Kinderklinik München Schwabing, TUM School of Medicine and Health, Technical University of Munich, Munich, Germany
- CCC München Comprehensive Cancer Center, DKTK German Cancer Consortium, Munich, Germany
- Institute of Pathology, Translation Pediatric Cancer Research Action, School of Medicine, Technical University of Munich, Munich, Germany
| | - Lincoln D. Stein
- Ontario Institute for Cancer Research, Toronto, Ontario, Canada
- Department of Molecular Genetics, University of Toronto, Toronto, Ontario, Canada
| | - Poul H. Sorensen
- Department of Molecular Oncology, British Columbia Cancer Research Centre and
- Department of Pathology and Laboratory Medicine, University of British Columbia, Vancouver, British Columbia, Canada
| |
Collapse
|
5
|
Wu Z, Li T, Jiang Z, Zheng J, Gu Y, Liu Y, Liu Y, Xie Z. Human pangenome analysis of sequences missing from the reference genome reveals their widespread evolutionary, phenotypic, and functional roles. Nucleic Acids Res 2024; 52:2212-2230. [PMID: 38364871 PMCID: PMC10954445 DOI: 10.1093/nar/gkae086] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/30/2023] [Revised: 01/18/2024] [Accepted: 01/27/2024] [Indexed: 02/18/2024] Open
Abstract
Nonreference sequences (NRSs) are DNA sequences present in global populations but absent in the current human reference genome. However, the extent and functional significance of NRSs in the human genomes and populations remains unclear. Here, we de novo assembled 539 genomes from five genetically divergent human populations using long-read sequencing technology, resulting in the identification of 5.1 million NRSs. These were merged into 45284 unique NRSs, with 29.7% being novel discoveries. Among these NRSs, 38.7% were common across the five populations, and 35.6% were population specific. The use of a graph-based pangenome approach allowed for the detection of 565 transcript expression quantitative trait loci on NRSs, with 426 of these being novel findings. Moreover, 26 NRS candidates displayed evidence of adaptive selection within human populations. Genes situated in close proximity to or intersecting with these candidates may be associated with metabolism and type 2 diabetes. Genome-wide association studies revealed 14 NRSs to be significantly associated with eight phenotypes. Additionally, 154 NRSs were found to be in strong linkage disequilibrium with 258 phenotype-associated SNPs in the GWAS catalogue. Our work expands the understanding of human NRSs and provides novel insights into their functions, facilitating evolutionary and biomedical researches.
Collapse
Affiliation(s)
- Zhikun Wu
- State Key Laboratory of Ophthalmology, Zhongshan Ophthalmic Center, Sun Yat-sen University, Guangzhou, China
| | - Tong Li
- State Key Laboratory of Ophthalmology, Zhongshan Ophthalmic Center, Sun Yat-sen University, Guangzhou, China
| | - Zehang Jiang
- State Key Laboratory of Ophthalmology, Zhongshan Ophthalmic Center, Sun Yat-sen University, Guangzhou, China
| | - Jingjing Zheng
- State Key Laboratory of Ophthalmology, Zhongshan Ophthalmic Center, Sun Yat-sen University, Guangzhou, China
| | - Yizhou Gu
- Center for Precision Medicine, Sun Yat-sen University, Guangzhou, China
- University of Wisconsin-Madison, WI, USA
| | - Yizhi Liu
- State Key Laboratory of Ophthalmology, Zhongshan Ophthalmic Center, Sun Yat-sen University, Guangzhou, China
| | - Yun Liu
- MOE Key Laboratory of Metabolism and Molecular Medicine, Department of Biochemistry and Molecular Biology, School of Basic Medical Sciences and Shanghai Xuhui Central Hospital, Fudan University, Shanghai, China
| | - Zhi Xie
- State Key Laboratory of Ophthalmology, Zhongshan Ophthalmic Center, Sun Yat-sen University, Guangzhou, China
- Center for Precision Medicine, Sun Yat-sen University, Guangzhou, China
| |
Collapse
|
6
|
Annapragada AV, Niknafs N, White JR, Bruhm DC, Cherry C, Medina JE, Adleff V, Hruban C, Mathios D, Foda ZH, Phallen J, Scharpf RB, Velculescu VE. Genome-wide repeat landscapes in cancer and cell-free DNA. Sci Transl Med 2024; 16:eadj9283. [PMID: 38478628 DOI: 10.1126/scitranslmed.adj9283] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/24/2023] [Accepted: 02/16/2024] [Indexed: 03/22/2024]
Abstract
Genetic changes in repetitive sequences are a hallmark of cancer and other diseases, but characterizing these has been challenging using standard sequencing approaches. We developed a de novo kmer finding approach, called ARTEMIS (Analysis of RepeaT EleMents in dISease), to identify repeat elements from whole-genome sequencing. Using this method, we analyzed 1.2 billion kmers in 2837 tissue and plasma samples from 1975 patients, including those with lung, breast, colorectal, ovarian, liver, gastric, head and neck, bladder, cervical, thyroid, or prostate cancer. We identified tumor-specific changes in these patients in 1280 repeat element types from the LINE, SINE, LTR, transposable element, and human satellite families. These included changes to known repeats and 820 elements that were not previously known to be altered in human cancer. Repeat elements were enriched in regions of driver genes, and their representation was altered by structural changes and epigenetic states. Machine learning analyses of genome-wide repeat landscapes and fragmentation profiles in cfDNA detected patients with early-stage lung or liver cancer in cross-validated and externally validated cohorts. In addition, these repeat landscapes could be used to noninvasively identify the tissue of origin of tumors. These analyses reveal widespread changes in repeat landscapes of human cancers and provide an approach for their detection and characterization that could benefit early detection and disease monitoring of patients with cancer.
Collapse
Affiliation(s)
- Akshaya V Annapragada
- Sidney Kimmel Comprehensive Cancer Center, Johns Hopkins University School of Medicine, Baltimore, MD 21287, USA
| | - Noushin Niknafs
- Sidney Kimmel Comprehensive Cancer Center, Johns Hopkins University School of Medicine, Baltimore, MD 21287, USA
| | - James R White
- Sidney Kimmel Comprehensive Cancer Center, Johns Hopkins University School of Medicine, Baltimore, MD 21287, USA
| | - Daniel C Bruhm
- Sidney Kimmel Comprehensive Cancer Center, Johns Hopkins University School of Medicine, Baltimore, MD 21287, USA
| | - Christopher Cherry
- Sidney Kimmel Comprehensive Cancer Center, Johns Hopkins University School of Medicine, Baltimore, MD 21287, USA
| | - Jamie E Medina
- Sidney Kimmel Comprehensive Cancer Center, Johns Hopkins University School of Medicine, Baltimore, MD 21287, USA
| | - Vilmos Adleff
- Sidney Kimmel Comprehensive Cancer Center, Johns Hopkins University School of Medicine, Baltimore, MD 21287, USA
| | - Carolyn Hruban
- Sidney Kimmel Comprehensive Cancer Center, Johns Hopkins University School of Medicine, Baltimore, MD 21287, USA
| | - Dimitrios Mathios
- Sidney Kimmel Comprehensive Cancer Center, Johns Hopkins University School of Medicine, Baltimore, MD 21287, USA
| | - Zachariah H Foda
- Sidney Kimmel Comprehensive Cancer Center, Johns Hopkins University School of Medicine, Baltimore, MD 21287, USA
- Department of Medicine, Johns Hopkins University School of Medicine, Baltimore, MD 21287, USA
| | - Jillian Phallen
- Sidney Kimmel Comprehensive Cancer Center, Johns Hopkins University School of Medicine, Baltimore, MD 21287, USA
| | - Robert B Scharpf
- Sidney Kimmel Comprehensive Cancer Center, Johns Hopkins University School of Medicine, Baltimore, MD 21287, USA
| | - Victor E Velculescu
- Sidney Kimmel Comprehensive Cancer Center, Johns Hopkins University School of Medicine, Baltimore, MD 21287, USA
- Department of Medicine, Johns Hopkins University School of Medicine, Baltimore, MD 21287, USA
| |
Collapse
|
7
|
Chrisman B, He C, Jung JY, Stockham N, Paskov K, Washington P, Petereit J, Wall DP. Localizing unmapped sequences with families to validate the Telomere-to-Telomere assembly and identify new hotspots for genetic diversity. Genome Res 2023; 33:1734-1746. [PMID: 37879860 PMCID: PMC10691534 DOI: 10.1101/gr.277175.122] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/02/2022] [Accepted: 05/25/2023] [Indexed: 10/27/2023]
Abstract
Although it is ubiquitous in genomics, the current human reference genome (GRCh38) is incomplete: It is missing large sections of heterochromatic sequence, and as a singular, linear reference genome, it does not represent the full spectrum of human genetic diversity. To characterize gaps in GRCh38 and human genetic diversity, we developed an algorithm for sequence location approximation using nuclear families (ASLAN) to identify the region of origin of reads that do not align to GRCh38. Using unmapped reads and variant calls from whole-genome sequences (WGSs), ASLAN uses a maximum likelihood model to identify the most likely region of the genome that a subsequence belongs to given the distribution of the subsequence in the unmapped reads and phasings of families. Validating ASLAN on synthetic data and on reads from the alternative haplotypes in the decoy genome, ASLAN localizes >90% of 100-bp sequences with >92% accuracy and ∼1 Mb of resolution. We then ran ASLAN on 100-mers from unmapped reads from WGS from more than 700 families, and compared ASLAN localizations to alignment of the 100-mers to the recently released T2T-CHM13 assembly. We found that many unmapped reads in GRCh38 originate from telomeres and centromeres that are gaps in GRCh38. ASLAN localizations are in high concordance with T2T-CHM13 alignments, except in the centromeres of the acrocentric chromosomes. Comparing ASLAN localizations and T2T-CHM13 alignments, we identified sequences missing from T2T-CHM13 or sequences with high divergence from their aligned region in T2T-CHM13, highlighting new hotspots for genetic diversity.
Collapse
Affiliation(s)
- Brianna Chrisman
- Department of Bioengineering, Stanford University, Stanford, California 94305, USA;
- Nevada Bioinformatics Center, University of Nevada, Reno, Nevada 89557, USA
| | - Chloe He
- Department of Biomedical Data Science, Stanford University, Stanford, California 94305, USA
| | - Jae-Yoon Jung
- Department of Pediatrics (Systems Medicine), Stanford University, Stanford, California 94305, USA
| | - Nate Stockham
- Department of Neuroscience, Stanford University, Stanford, California 94305, USA
| | - Kelley Paskov
- Department of Biomedical Data Science, Stanford University, Stanford, California 94305, USA
| | - Peter Washington
- Department of Bioengineering, Stanford University, Stanford, California 94305, USA
| | - Juli Petereit
- Nevada Bioinformatics Center, University of Nevada, Reno, Nevada 89557, USA
| | - Dennis P Wall
- Department of Biomedical Data Science, Stanford University, Stanford, California 94305, USA
- Department of Pediatrics (Systems Medicine), Stanford University, Stanford, California 94305, USA
| |
Collapse
|
8
|
Rhie A, Nurk S, Cechova M, Hoyt SJ, Taylor DJ, Altemose N, Hook PW, Koren S, Rautiainen M, Alexandrov IA, Allen J, Asri M, Bzikadze AV, Chen NC, Chin CS, Diekhans M, Flicek P, Formenti G, Fungtammasan A, Garcia Giron C, Garrison E, Gershman A, Gerton JL, Grady PGS, Guarracino A, Haggerty L, Halabian R, Hansen NF, Harris R, Hartley GA, Harvey WT, Haukness M, Heinz J, Hourlier T, Hubley RM, Hunt SE, Hwang S, Jain M, Kesharwani RK, Lewis AP, Li H, Logsdon GA, Lucas JK, Makalowski W, Markovic C, Martin FJ, Mc Cartney AM, McCoy RC, McDaniel J, McNulty BM, Medvedev P, Mikheenko A, Munson KM, Murphy TD, Olsen HE, Olson ND, Paulin LF, Porubsky D, Potapova T, Ryabov F, Salzberg SL, Sauria MEG, Sedlazeck FJ, Shafin K, Shepelev VA, Shumate A, Storer JM, Surapaneni L, Taravella Oill AM, Thibaud-Nissen F, Timp W, Tomaszkiewicz M, Vollger MR, Walenz BP, Watwood AC, Weissensteiner MH, Wenger AM, Wilson MA, Zarate S, Zhu Y, Zook JM, Eichler EE, O'Neill RJ, Schatz MC, Miga KH, Makova KD, Phillippy AM. The complete sequence of a human Y chromosome. Nature 2023; 621:344-354. [PMID: 37612512 PMCID: PMC10752217 DOI: 10.1038/s41586-023-06457-y] [Citation(s) in RCA: 56] [Impact Index Per Article: 56.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/02/2022] [Accepted: 07/19/2023] [Indexed: 08/25/2023]
Abstract
The human Y chromosome has been notoriously difficult to sequence and assemble because of its complex repeat structure that includes long palindromes, tandem repeats and segmental duplications1-3. As a result, more than half of the Y chromosome is missing from the GRCh38 reference sequence and it remains the last human chromosome to be finished4,5. Here, the Telomere-to-Telomere (T2T) consortium presents the complete 62,460,029-base-pair sequence of a human Y chromosome from the HG002 genome (T2T-Y) that corrects multiple errors in GRCh38-Y and adds over 30 million base pairs of sequence to the reference, showing the complete ampliconic structures of gene families TSPY, DAZ and RBMY; 41 additional protein-coding genes, mostly from the TSPY family; and an alternating pattern of human satellite 1 and 3 blocks in the heterochromatic Yq12 region. We have combined T2T-Y with a previous assembly of the CHM13 genome4 and mapped available population variation, clinical variants and functional genomics data to produce a complete and comprehensive reference sequence for all 24 human chromosomes.
Collapse
Affiliation(s)
- Arang Rhie
- Genome Informatics Section, Computational and Statistical Genomics Branch, National Human Genome Research Institute, National Institutes of Health, Bethesda, MD, USA
| | - Sergey Nurk
- Genome Informatics Section, Computational and Statistical Genomics Branch, National Human Genome Research Institute, National Institutes of Health, Bethesda, MD, USA
- Oxford Nanopore Technologies Inc., Oxford, UK
| | - Monika Cechova
- Faculty of Informatics, Masaryk University, Brno, Czech Republic
- Department of Biomolecular Engineering, University of California Santa Cruz, Santa Cruz, CA, USA
| | - Savannah J Hoyt
- Department of Molecular and Cell Biology, University of Connecticut, Storrs, CT, USA
| | - Dylan J Taylor
- Department of Biology, Johns Hopkins University, Baltimore, MD, USA
| | - Nicolas Altemose
- Department of Molecular and Cell Biology, University of California, Berkeley, CA, USA
| | - Paul W Hook
- Department of Biomedical Engineering, Johns Hopkins University, Baltimore, MD, USA
| | - Sergey Koren
- Genome Informatics Section, Computational and Statistical Genomics Branch, National Human Genome Research Institute, National Institutes of Health, Bethesda, MD, USA
| | - Mikko Rautiainen
- Genome Informatics Section, Computational and Statistical Genomics Branch, National Human Genome Research Institute, National Institutes of Health, Bethesda, MD, USA
| | - Ivan A Alexandrov
- Federal Research Center of Biotechnology of the Russian Academy of Sciences, Moscow, Russia
- Center for Algorithmic Biotechnology, Saint Petersburg State University, St Petersburg, Russia
- Department of Anatomy and Anthropology and Department of Human Molecular Genetics and Biochemistry, Sackler Faculty of Medicine, Tel Aviv University, Tel Aviv-Yafo, Israel
| | - Jamie Allen
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge, UK
| | - Mobin Asri
- UC Santa Cruz Genomics Institute, University of California Santa Cruz, Santa Cruz, CA, USA
| | - Andrey V Bzikadze
- Graduate Program in Bioinformatics and Systems Biology, University of California, San Diego, CA, USA
| | - Nae-Chyun Chen
- Department of Computer Science, Johns Hopkins University, Baltimore, MD, USA
| | - Chen-Shan Chin
- GeneDX Holdings Corp, Stamford, CT, USA
- Foundation of Biological Data Science, Belmont, CA, USA
| | - Mark Diekhans
- UC Santa Cruz Genomics Institute, University of California Santa Cruz, Santa Cruz, CA, USA
| | - Paul Flicek
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge, UK
- Department of Genetics, University of Cambridge, Cambridge, UK
| | | | | | - Carlos Garcia Giron
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge, UK
| | - Erik Garrison
- Department of Genetics, Genomics and Informatics, University of Tennessee Health Science Center, Memphis, TN, USA
| | - Ariel Gershman
- Department of Biomedical Engineering, Johns Hopkins University, Baltimore, MD, USA
| | - Jennifer L Gerton
- Stowers Institute for Medical Research, Kansas City, MO, USA
- University of Kansas Medical Center, Kansas City, MO, USA
| | - Patrick G S Grady
- Department of Molecular and Cell Biology, University of Connecticut, Storrs, CT, USA
| | - Andrea Guarracino
- Department of Genetics, Genomics and Informatics, University of Tennessee Health Science Center, Memphis, TN, USA
- Genomics Research Centre, Human Technopole, Milan, Italy
| | - Leanne Haggerty
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge, UK
| | - Reza Halabian
- Institute of Bioinformatics, Faculty of Medicine, University of Münster, Münster, Germany
| | - Nancy F Hansen
- Genome Informatics Section, Computational and Statistical Genomics Branch, National Human Genome Research Institute, National Institutes of Health, Bethesda, MD, USA
- Cancer Genetics and Comparative Genomics Branch, National Human Genome Research Institute, National Institutes of Health, Bethesda, MD, USA
| | - Robert Harris
- Department of Biology, Pennsylvania State University, University Park, PA, USA
| | - Gabrielle A Hartley
- Department of Molecular and Cell Biology, University of Connecticut, Storrs, CT, USA
| | - William T Harvey
- Department of Genome Sciences, University of Washington School of Medicine, Seattle, WA, USA
| | - Marina Haukness
- UC Santa Cruz Genomics Institute, University of California Santa Cruz, Santa Cruz, CA, USA
| | - Jakob Heinz
- Department of Biomedical Engineering, Johns Hopkins University, Baltimore, MD, USA
| | - Thibaut Hourlier
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge, UK
| | | | - Sarah E Hunt
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge, UK
| | - Stephen Hwang
- XDBio Program, Johns Hopkins University, Baltimore, MD, USA
| | - Miten Jain
- Department of Bioengineering, Department of Physics, Northeastern University, Boston, MA, USA
| | - Rupesh K Kesharwani
- Human Genome Sequencing Center, Baylor College of Medicine, One Baylor Plaza, Houston, TX, USA
| | - Alexandra P Lewis
- Department of Genome Sciences, University of Washington School of Medicine, Seattle, WA, USA
| | - Heng Li
- Department of Data Sciences, Dana-Farber Cancer Institute, Boston, MA, USA
- Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA
| | - Glennis A Logsdon
- Department of Genome Sciences, University of Washington School of Medicine, Seattle, WA, USA
| | - Julian K Lucas
- Department of Biomolecular Engineering, University of California Santa Cruz, Santa Cruz, CA, USA
- UC Santa Cruz Genomics Institute, University of California Santa Cruz, Santa Cruz, CA, USA
| | - Wojciech Makalowski
- Institute of Bioinformatics, Faculty of Medicine, University of Münster, Münster, Germany
| | - Christopher Markovic
- Genome Technology Access Center at the McDonnell Genome Institute, Washington University, St. Louis, MO, USA
| | - Fergal J Martin
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge, UK
| | - Ann M Mc Cartney
- Genome Informatics Section, Computational and Statistical Genomics Branch, National Human Genome Research Institute, National Institutes of Health, Bethesda, MD, USA
| | - Rajiv C McCoy
- Department of Biology, Johns Hopkins University, Baltimore, MD, USA
| | - Jennifer McDaniel
- Biosystems and Biomaterials Division, National Institute of Standards and Technology, Gaithersburg, MD, USA
| | - Brandy M McNulty
- Department of Biomolecular Engineering, University of California Santa Cruz, Santa Cruz, CA, USA
- UC Santa Cruz Genomics Institute, University of California Santa Cruz, Santa Cruz, CA, USA
| | - Paul Medvedev
- Department of Computer Science and Engineering, Pennsylvania State University, University Park, PA, USA
- Department of Biochemistry and Molecular Biology, Pennsylvania State University, University Park, PA, USA
- Center for Computational Biology and Bioinformatics, Pennsylvania State University, University Park, PA, USA
| | - Alla Mikheenko
- Center for Algorithmic Biotechnology, Saint Petersburg State University, St Petersburg, Russia
- UCL Queen Square Institute of Neurology, UCL, London, UK
| | - Katherine M Munson
- Department of Genome Sciences, University of Washington School of Medicine, Seattle, WA, USA
| | - Terence D Murphy
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD, USA
| | - Hugh E Olsen
- Department of Biomolecular Engineering, University of California Santa Cruz, Santa Cruz, CA, USA
- UC Santa Cruz Genomics Institute, University of California Santa Cruz, Santa Cruz, CA, USA
| | - Nathan D Olson
- Biosystems and Biomaterials Division, National Institute of Standards and Technology, Gaithersburg, MD, USA
| | - Luis F Paulin
- Human Genome Sequencing Center, Baylor College of Medicine, One Baylor Plaza, Houston, TX, USA
| | - David Porubsky
- Department of Genome Sciences, University of Washington School of Medicine, Seattle, WA, USA
| | - Tamara Potapova
- Stowers Institute for Medical Research, Kansas City, MO, USA
| | - Fedor Ryabov
- Masters Program in National Research University Higher School of Economics, Moscow, Russia
| | - Steven L Salzberg
- Departments of Biomedical Engineering, Computer Science, and Biostatistics, Johns Hopkins University, Baltimore, MD, USA
| | | | - Fritz J Sedlazeck
- Human Genome Sequencing Center, Baylor College of Medicine, One Baylor Plaza, Houston, TX, USA
- Department of Computer Science, Rice University, Houston, TX, USA
| | | | | | - Alaina Shumate
- Department of Biomedical Engineering, Johns Hopkins University, Baltimore, MD, USA
| | | | - Likhitha Surapaneni
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge, UK
| | - Angela M Taravella Oill
- Center for Evolution and Medicine, School of Life Sciences, Arizona State University, Tempe, AZ, USA
| | - Françoise Thibaud-Nissen
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD, USA
| | - Winston Timp
- Department of Biomedical Engineering, Johns Hopkins University, Baltimore, MD, USA
| | - Marta Tomaszkiewicz
- Department of Biology, Pennsylvania State University, University Park, PA, USA
- Department of Biomedical Engineering, Pennsylvania State University, State College, PA, USA
| | - Mitchell R Vollger
- Department of Genome Sciences, University of Washington School of Medicine, Seattle, WA, USA
| | - Brian P Walenz
- Genome Informatics Section, Computational and Statistical Genomics Branch, National Human Genome Research Institute, National Institutes of Health, Bethesda, MD, USA
| | - Allison C Watwood
- Department of Biology, Pennsylvania State University, University Park, PA, USA
| | | | | | - Melissa A Wilson
- Center for Evolution and Medicine, School of Life Sciences, Arizona State University, Tempe, AZ, USA
| | - Samantha Zarate
- Department of Computer Science, Johns Hopkins University, Baltimore, MD, USA
| | - Yiming Zhu
- Human Genome Sequencing Center, Baylor College of Medicine, One Baylor Plaza, Houston, TX, USA
| | - Justin M Zook
- Biosystems and Biomaterials Division, National Institute of Standards and Technology, Gaithersburg, MD, USA
| | - Evan E Eichler
- Department of Genome Sciences, University of Washington School of Medicine, Seattle, WA, USA
- Investigator, Howard Hughes Medical Institute, University of Washington, Seattle, WA, USA
| | - Rachel J O'Neill
- Department of Molecular and Cell Biology, University of Connecticut, Storrs, CT, USA
- Institute for Systems Genomics, University of Connecticut, Storrs, CT, USA
- Department of Genetics and Genome Sciences, UConn Health, Farmington, CT, USA
| | - Michael C Schatz
- Department of Biology, Johns Hopkins University, Baltimore, MD, USA
- Department of Computer Science, Johns Hopkins University, Baltimore, MD, USA
| | - Karen H Miga
- Department of Biomolecular Engineering, University of California Santa Cruz, Santa Cruz, CA, USA
- UC Santa Cruz Genomics Institute, University of California Santa Cruz, Santa Cruz, CA, USA
| | - Kateryna D Makova
- Department of Biology, Pennsylvania State University, University Park, PA, USA
| | - Adam M Phillippy
- Genome Informatics Section, Computational and Statistical Genomics Branch, National Human Genome Research Institute, National Institutes of Health, Bethesda, MD, USA.
| |
Collapse
|
9
|
Hallast P, Ebert P, Loftus M, Yilmaz F, Audano PA, Logsdon GA, Bonder MJ, Zhou W, Höps W, Kim K, Li C, Hoyt SJ, Dishuck PC, Porubsky D, Tsetsos F, Kwon JY, Zhu Q, Munson KM, Hasenfeld P, Harvey WT, Lewis AP, Kordosky J, Hoekzema K, O'Neill RJ, Korbel JO, Tyler-Smith C, Eichler EE, Shi X, Beck CR, Marschall T, Konkel MK, Lee C. Assembly of 43 human Y chromosomes reveals extensive complexity and variation. Nature 2023; 621:355-364. [PMID: 37612510 PMCID: PMC10726138 DOI: 10.1038/s41586-023-06425-6] [Citation(s) in RCA: 14] [Impact Index Per Article: 14.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/30/2022] [Accepted: 07/11/2023] [Indexed: 08/25/2023]
Abstract
The prevalence of highly repetitive sequences within the human Y chromosome has prevented its complete assembly to date1 and led to its systematic omission from genomic analyses. Here we present de novo assemblies of 43 Y chromosomes spanning 182,900 years of human evolution and report considerable diversity in size and structure. Half of the male-specific euchromatic region is subject to large inversions with a greater than twofold higher recurrence rate compared with all other chromosomes2. Ampliconic sequences associated with these inversions show differing mutation rates that are sequence context dependent, and some ampliconic genes exhibit evidence for concerted evolution with the acquisition and purging of lineage-specific pseudogenes. The largest heterochromatic region in the human genome, Yq12, is composed of alternating repeat arrays that show extensive variation in the number, size and distribution, but retain a 1:1 copy-number ratio. Finally, our data suggest that the boundary between the recombining pseudoautosomal region 1 and the non-recombining portions of the X and Y chromosomes lies 500 kb away from the currently established1 boundary. The availability of fully sequence-resolved Y chromosomes from multiple individuals provides a unique opportunity for identifying new associations of traits with specific Y-chromosomal variants and garnering insights into the evolution and function of complex regions of the human genome.
Collapse
Affiliation(s)
- Pille Hallast
- The Jackson Laboratory for Genomic Medicine, Farmington, CT, USA
| | - Peter Ebert
- Institute for Medical Biometry and Bioinformatics, Medical Faculty, Heinrich Heine University, Düsseldorf, Germany
- Core Unit Bioinformatics, Medical Faculty, Heinrich Heine University, Düsseldorf, Germany
- Center for Digital Medicine, Heinrich Heine University, Düsseldorf, Germany
| | - Mark Loftus
- Department of Genetics & Biochemistry, Clemson University, Clemson, SC, USA
- Center for Human Genetics, Clemson University, Greenwood, SC, USA
| | - Feyza Yilmaz
- The Jackson Laboratory for Genomic Medicine, Farmington, CT, USA
| | - Peter A Audano
- The Jackson Laboratory for Genomic Medicine, Farmington, CT, USA
| | - Glennis A Logsdon
- Department of Genome Sciences, University of Washington School of Medicine, Seattle, WA, USA
| | - Marc Jan Bonder
- Division of Computational Genomics and Systems Genetics, German Cancer Research Center (DKFZ), Heidelberg, Germany
- Department of Genetics, University Medical Center Groningen, University of Groningen, Groningen, The Netherlands
| | - Weichen Zhou
- Department of Computational Medicine and Bioinformatics, University of Michigan Medical School, Ann Arbor, MI, USA
| | - Wolfram Höps
- Genome Biology Unit, European Molecular Biology Laboratory (EMBL), Heidelberg, Germany
| | - Kwondo Kim
- The Jackson Laboratory for Genomic Medicine, Farmington, CT, USA
| | - Chong Li
- Department of Computer and Information Sciences, Temple University, Philadelphia, PA, USA
| | - Savannah J Hoyt
- Department of Molecular and Cell Biology, University of Connecticut, Storrs, CT, USA
| | - Philip C Dishuck
- Department of Genome Sciences, University of Washington School of Medicine, Seattle, WA, USA
| | - David Porubsky
- Department of Genome Sciences, University of Washington School of Medicine, Seattle, WA, USA
| | - Fotios Tsetsos
- The Jackson Laboratory for Genomic Medicine, Farmington, CT, USA
| | - Jee Young Kwon
- The Jackson Laboratory for Genomic Medicine, Farmington, CT, USA
| | - Qihui Zhu
- The Jackson Laboratory for Genomic Medicine, Farmington, CT, USA
| | - Katherine M Munson
- Department of Genome Sciences, University of Washington School of Medicine, Seattle, WA, USA
| | - Patrick Hasenfeld
- Genome Biology Unit, European Molecular Biology Laboratory (EMBL), Heidelberg, Germany
| | - William T Harvey
- Department of Genome Sciences, University of Washington School of Medicine, Seattle, WA, USA
| | - Alexandra P Lewis
- Department of Genome Sciences, University of Washington School of Medicine, Seattle, WA, USA
| | - Jennifer Kordosky
- Department of Genome Sciences, University of Washington School of Medicine, Seattle, WA, USA
| | - Kendra Hoekzema
- Department of Genome Sciences, University of Washington School of Medicine, Seattle, WA, USA
| | - Rachel J O'Neill
- Department of Molecular and Cell Biology, University of Connecticut, Storrs, CT, USA
- Institute for Systems Genomics, University of Connecticut, Storrs, CT, USA
- The University of Connecticut Health Center, Farmington, CT, USA
| | - Jan O Korbel
- Genome Biology Unit, European Molecular Biology Laboratory (EMBL), Heidelberg, Germany
| | | | - Evan E Eichler
- Department of Genome Sciences, University of Washington School of Medicine, Seattle, WA, USA
- Howard Hughes Medical Institute, University of Washington, Seattle, WA, USA
| | - Xinghua Shi
- Department of Computer and Information Sciences, Temple University, Philadelphia, PA, USA
| | - Christine R Beck
- The Jackson Laboratory for Genomic Medicine, Farmington, CT, USA
- Institute for Systems Genomics, University of Connecticut, Storrs, CT, USA
- The University of Connecticut Health Center, Farmington, CT, USA
| | - Tobias Marschall
- Institute for Medical Biometry and Bioinformatics, Medical Faculty, Heinrich Heine University, Düsseldorf, Germany
- Center for Digital Medicine, Heinrich Heine University, Düsseldorf, Germany
| | - Miriam K Konkel
- Department of Genetics & Biochemistry, Clemson University, Clemson, SC, USA
- Center for Human Genetics, Clemson University, Greenwood, SC, USA
| | - Charles Lee
- The Jackson Laboratory for Genomic Medicine, Farmington, CT, USA.
| |
Collapse
|
10
|
Ponomartsev N, Zilov D, Gushcha E, Travina A, Sergeev A, Enukashvily N. Overexpression of Pericentromeric HSAT2 DNA Increases Expression of EMT Markers in Human Epithelial Cancer Cell Lines. Int J Mol Sci 2023; 24:ijms24086918. [PMID: 37108080 PMCID: PMC10138405 DOI: 10.3390/ijms24086918] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/01/2023] [Revised: 04/02/2023] [Accepted: 04/04/2023] [Indexed: 04/29/2023] Open
Abstract
Pericentromeric tandemly repeated DNA of human satellites 1, 2, and 3 (HS1, HS2, and HS3) is actively transcribed in some cells. However, the functionality of the transcription remains obscure. Studies in this area have been hampered by the absence of a gapless genome assembly. The aim of our study was to map a transcript that we have previously described as HS2/HS3 on chromosomes using a newly published gapless genome assembly T2T-CHM13, and create a plasmid overexpressing the transcript to assess the influence of HS2/HS3 transcription on cancer cells. We report here that the sequence of the transcript is tandemly repeated on nine chromosomes (1, 2, 7, 9, 10, 16, 17, 22, and Y). A detailed analysis of its genomic localization and annotation in the T2T-CHM13 assembly revealed that the sequence belonged to HSAT2 (HS2) but not to the HS3 family of tandemly repeated DNA. The transcript was found on both strands of HSAT2 arrays. The overexpression of the HSAT2 transcript increased the transcription of the genes encoding the proteins involved in the epithelial-to-mesenchymal transition, EMT (SNAI1, ZEB1, and SNAI2), and the genes that mark cancer-associated fibroblasts (VIM, COL1A1, COL11A1, and ACTA2) in cancer cell lines A549 and HeLa. Co-transfection of the overexpression plasmid and antisense nucleotides eliminated the transcription of EMT genes observed after HSAT2 overexpression. Antisense oligonucleotides also decreased transcription of the EMT genes induced by tumor growth factor beta 1 (TGFβ1). Thus, our study suggests HSAT2 lncRNA transcribed from the pericentromeric tandemly repeated DNA is involved in EMT regulation in cancer cells.
Collapse
Affiliation(s)
- Nikita Ponomartsev
- Institute of Cytology, Russian Academy of Sciences, St. Petersburg 194064, Russia
| | - Danil Zilov
- Institute of Cytology, Russian Academy of Sciences, St. Petersburg 194064, Russia
- Applied Genomics Laboratory, SCAMT Institute, ITMO University, Saint Petersburg 191002, Russia
| | - Ekaterina Gushcha
- Institute of Cytology, Russian Academy of Sciences, St. Petersburg 194064, Russia
| | - Alexandra Travina
- Institute of Cytology, Russian Academy of Sciences, St. Petersburg 194064, Russia
| | - Alexander Sergeev
- Institute of Cytology, Russian Academy of Sciences, St. Petersburg 194064, Russia
| | - Natella Enukashvily
- Institute of Cytology, Russian Academy of Sciences, St. Petersburg 194064, Russia
| |
Collapse
|
11
|
Lopes M, Louzada S, Ferreira D, Veríssimo G, Eleutério D, Gama-Carvalho M, Chaves R. Human Satellite 1A analysis provides evidence of pericentromeric transcription. BMC Biol 2023; 21:28. [PMID: 36755311 PMCID: PMC9909926 DOI: 10.1186/s12915-023-01521-5] [Citation(s) in RCA: 6] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/14/2022] [Accepted: 01/19/2023] [Indexed: 02/10/2023] Open
Abstract
BACKGROUND Pericentromeric regions of human chromosomes are composed of tandem-repeated and highly organized sequences named satellite DNAs. Human classical satellite DNAs are classified into three families named HSat1, HSat2, and HSat3, which have historically posed a challenge for the assembly of the human reference genome where they are misrepresented due to their repetitive nature. Although being known for a long time as the most AT-rich fraction of the human genome, classical satellite HSat1A has been disregarded in genomic and transcriptional studies, falling behind other human satellites in terms of functional knowledge. Here, we aim to characterize and provide an understanding on the biological relevance of HSat1A. RESULTS The path followed herein trails with HSat1A isolation and cloning, followed by in silico analysis. Monomer copy number and expression data was obtained in a wide variety of human cell lines, with greatly varying profiles in tumoral/non-tumoral samples. HSat1A was mapped in human chromosomes and applied in in situ transcriptional assays. Additionally, it was possible to observe the nuclear organization of HSat1A transcripts and further characterize them by 3' RACE-Seq. Size-varying polyadenylated HSat1A transcripts were detected, which possibly accounts for the intricate regulation of alternative polyadenylation. CONCLUSION As far as we know, this work pioneers HSat1A transcription studies. With the emergence of new human genome assemblies, acrocentric pericentromeres are becoming relevant characters in disease and other biological contexts. HSat1A sequences and associated noncoding RNAs will most certainly prove significant in the future of HSat research.
Collapse
Affiliation(s)
- Mariana Lopes
- grid.12341.350000000121821287CytoGenomics Lab, Department of Genetics and Biotechnology (DGB), University of Trás-Os-Montes and Alto Douro (UTAD), 5000-801 Vila Real, Portugal ,grid.9983.b0000 0001 2181 4263BioISI – Biosystems & Integrative Sciences Institute, Faculty of Sciences, University of Lisboa, 1749-016 Lisbon, Portugal
| | - Sandra Louzada
- grid.12341.350000000121821287CytoGenomics Lab, Department of Genetics and Biotechnology (DGB), University of Trás-Os-Montes and Alto Douro (UTAD), 5000-801 Vila Real, Portugal ,grid.9983.b0000 0001 2181 4263BioISI – Biosystems & Integrative Sciences Institute, Faculty of Sciences, University of Lisboa, 1749-016 Lisbon, Portugal
| | - Daniela Ferreira
- grid.12341.350000000121821287CytoGenomics Lab, Department of Genetics and Biotechnology (DGB), University of Trás-Os-Montes and Alto Douro (UTAD), 5000-801 Vila Real, Portugal ,grid.9983.b0000 0001 2181 4263BioISI – Biosystems & Integrative Sciences Institute, Faculty of Sciences, University of Lisboa, 1749-016 Lisbon, Portugal
| | - Gabriela Veríssimo
- grid.12341.350000000121821287CytoGenomics Lab, Department of Genetics and Biotechnology (DGB), University of Trás-Os-Montes and Alto Douro (UTAD), 5000-801 Vila Real, Portugal ,grid.9983.b0000 0001 2181 4263BioISI – Biosystems & Integrative Sciences Institute, Faculty of Sciences, University of Lisboa, 1749-016 Lisbon, Portugal
| | - Daniel Eleutério
- grid.9983.b0000 0001 2181 4263BioISI – Biosystems & Integrative Sciences Institute, Faculty of Sciences, University of Lisboa, 1749-016 Lisbon, Portugal
| | - Margarida Gama-Carvalho
- grid.9983.b0000 0001 2181 4263BioISI – Biosystems & Integrative Sciences Institute, Faculty of Sciences, University of Lisboa, 1749-016 Lisbon, Portugal
| | - Raquel Chaves
- CytoGenomics Lab, Department of Genetics and Biotechnology (DGB), University of Trás-Os-Montes and Alto Douro (UTAD), 5000-801, Vila Real, Portugal. .,BioISI - Biosystems & Integrative Sciences Institute, Faculty of Sciences, University of Lisboa, 1749-016, Lisbon, Portugal.
| |
Collapse
|
12
|
Mirceta M, Shum N, Schmidt MHM, Pearson CE. Fragile sites, chromosomal lesions, tandem repeats, and disease. Front Genet 2022; 13:985975. [PMID: 36468036 PMCID: PMC9714581 DOI: 10.3389/fgene.2022.985975] [Citation(s) in RCA: 7] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/04/2022] [Accepted: 09/02/2022] [Indexed: 09/16/2023] Open
Abstract
Expanded tandem repeat DNAs are associated with various unusual chromosomal lesions, despiralizations, multi-branched inter-chromosomal associations, and fragile sites. Fragile sites cytogenetically manifest as localized gaps or discontinuities in chromosome structure and are an important genetic, biological, and health-related phenomena. Common fragile sites (∼230), present in most individuals, are induced by aphidicolin and can be associated with cancer; of the 27 molecularly-mapped common sites, none are associated with a particular DNA sequence motif. Rare fragile sites ( ≳ 40 known), ≤ 5% of the population (may be as few as a single individual), can be associated with neurodevelopmental disease. All 10 molecularly-mapped folate-sensitive fragile sites, the largest category of rare fragile sites, are caused by gene-specific CGG/CCG tandem repeat expansions that are aberrantly CpG methylated and include FRAXA, FRAXE, FRAXF, FRA2A, FRA7A, FRA10A, FRA11A, FRA11B, FRA12A, and FRA16A. The minisatellite-associated rare fragile sites, FRA10B, FRA16B, can be induced by AT-rich DNA-ligands or nucleotide analogs. Despiralized lesions and multi-branched inter-chromosomal associations at the heterochromatic satellite repeats of chromosomes 1, 9, 16 are inducible by de-methylating agents like 5-azadeoxycytidine and can spontaneously arise in patients with ICF syndrome (Immunodeficiency Centromeric instability and Facial anomalies) with mutations in genes regulating DNA methylation. ICF individuals have hypomethylated satellites I-III, alpha-satellites, and subtelomeric repeats. Ribosomal repeats and subtelomeric D4Z4 megasatellites/macrosatellites, are associated with chromosome location, fragility, and disease. Telomere repeats can also assume fragile sites. Dietary deficiencies of folate or vitamin B12, or drug insults are associated with megaloblastic and/or pernicious anemia, that display chromosomes with fragile sites. The recent discovery of many new tandem repeat expansion loci, with varied repeat motifs, where motif lengths can range from mono-nucleotides to megabase units, could be the molecular cause of new fragile sites, or other chromosomal lesions. This review focuses on repeat-associated fragility, covering their induction, cytogenetics, epigenetics, cell type specificity, genetic instability (repeat instability, micronuclei, deletions/rearrangements, and sister chromatid exchange), unusual heritability, disease association, and penetrance. Understanding tandem repeat-associated chromosomal fragile sites provides insight to chromosome structure, genome packaging, genetic instability, and disease.
Collapse
Affiliation(s)
- Mila Mirceta
- Program of Genetics and Genome Biology, The Hospital for Sick Children, The Peter Gilgan Centre for Research and Learning, Toronto, ON, Canada
- Program of Molecular Genetics, University of Toronto, Toronto, ON, Canada
| | - Natalie Shum
- Program of Genetics and Genome Biology, The Hospital for Sick Children, The Peter Gilgan Centre for Research and Learning, Toronto, ON, Canada
- Program of Molecular Genetics, University of Toronto, Toronto, ON, Canada
| | - Monika H. M. Schmidt
- Program of Genetics and Genome Biology, The Hospital for Sick Children, The Peter Gilgan Centre for Research and Learning, Toronto, ON, Canada
- Program of Molecular Genetics, University of Toronto, Toronto, ON, Canada
| | - Christopher E. Pearson
- Program of Genetics and Genome Biology, The Hospital for Sick Children, The Peter Gilgan Centre for Research and Learning, Toronto, ON, Canada
- Program of Molecular Genetics, University of Toronto, Toronto, ON, Canada
| |
Collapse
|
13
|
Liddiard K, Aston-Evans AN, Cleal K, Hendrickson E, Baird D. POLQ suppresses genome instability and alterations in DNA repeat tract lengths. NAR Cancer 2022; 4:zcac020. [PMID: 35774233 PMCID: PMC9241439 DOI: 10.1093/narcan/zcac020] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/07/2022] [Revised: 05/19/2022] [Accepted: 06/10/2022] [Indexed: 11/26/2022] Open
Abstract
DNA polymerase theta (POLQ) is a principal component of the alternative non-homologous end-joining (ANHEJ) DNA repair pathway that ligates DNA double-strand breaks. Utilizing independent models of POLQ insufficiency during telomere-driven crisis, we found that POLQ - /- cells are resistant to crisis-induced growth deceleration despite sustaining inter-chromosomal telomere fusion frequencies equivalent to wild-type (WT) cells. We recorded longer telomeres in POLQ - / - than WT cells pre- and post-crisis, notwithstanding elevated total telomere erosion and fusion rates. POLQ - /- cells emerging from crisis exhibited reduced incidence of clonal gross chromosomal abnormalities in accordance with increased genetic heterogeneity. High-throughput sequencing of telomere fusion amplicons from POLQ-deficient cells revealed significantly raised frequencies of inter-chromosomal fusions with correspondingly depreciated intra-chromosomal recombinations. Long-range interactions culminating in telomere fusions with centromere alpha-satellite repeats, as well as expansions in HSAT2 and HSAT3 satellite and contractions in ribosomal DNA repeats, were detected in POLQ - / - cells. In conjunction with the expanded telomere lengths of POLQ - /- cells, these results indicate a hitherto unrealized capacity of POLQ for regulation of repeat arrays within the genome. Our findings uncover novel considerations for the efficacy of POLQ inhibitors in clinical cancer interventions, where potential genome destabilizing consequences could drive clonal evolution and resistant disease.
Collapse
Affiliation(s)
- Kate Liddiard
- Division of Cancer and Genetics, School of Medicine, Cardiff University, Heath Park, Cardiff CF14 4XN, UK
| | - Alys N Aston-Evans
- Dementia Research Institute, School of Medicine, Cardiff University, Hadyn Ellis Building, Maindy Road, Cardiff CF24 4HQ, UK
| | - Kez Cleal
- Division of Cancer and Genetics, School of Medicine, Cardiff University, Heath Park, Cardiff CF14 4XN, UK
| | - Eric A Hendrickson
- Department of Biochemistry, Molecular Biology, and Biophysics, University of Minnesota Medical School, Minneapolis, MN 55455, USA
| | - Duncan M Baird
- Division of Cancer and Genetics, School of Medicine, Cardiff University, Heath Park, Cardiff CF14 4XN, UK
| |
Collapse
|
14
|
Cechova M, Miga KH. Satellite DNAs and human sex chromosome variation. Semin Cell Dev Biol 2022; 128:15-25. [PMID: 35644878 DOI: 10.1016/j.semcdb.2022.04.022] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/15/2022] [Revised: 04/26/2022] [Accepted: 04/27/2022] [Indexed: 11/17/2022]
Abstract
Satellite DNAs are present on every chromosome in the cell and are typically enriched in repetitive, heterochromatic parts of the human genome. Sex chromosomes represent a unique genomic and epigenetic context. In this review, we first report what is known about satellite DNA biology on human X and Y chromosomes, including repeat content and organization, as well as satellite variation in typical euploid individuals. Then, we review sex chromosome aneuploidies that are among the most common types of aneuploidies in the general population, and are better tolerated than autosomal aneuploidies. This is demonstrated also by the fact that aging is associated with the loss of the X, and especially the Y chromosome. In addition, supernumerary sex chromosomes enable us to study general processes in a cell, such as analyzing heterochromatin dosage (i.e. additional Barr bodies and long heterochromatin arrays on Yq) and their downstream consequences. Finally, genomic and epigenetic organization and regulation of satellite DNA could influence chromosome stability and lead to aneuploidy. In this review, we argue that the complete annotation of satellite DNA on sex chromosomes in human, and especially in centromeric regions, will aid in explaining the prevalence and the consequences of sex chromosome aneuploidies.
Collapse
Affiliation(s)
- Monika Cechova
- Faculty of Informatics, Masaryk University, Czech Republic
| | - Karen H Miga
- Department of Biomolecular Engineering, University of California Santa Cruz, CA, USA; UC Santa Cruz Genomics Institute, University of California Santa Cruz, CA 95064, USA
| |
Collapse
|
15
|
A classical revival: Human satellite DNAs enter the genomics era. Semin Cell Dev Biol 2022; 128:2-14. [PMID: 35487859 DOI: 10.1016/j.semcdb.2022.04.012] [Citation(s) in RCA: 12] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/21/2022] [Revised: 04/11/2022] [Accepted: 04/12/2022] [Indexed: 12/30/2022]
Abstract
The classical human satellite DNAs, also referred to as human satellites 1, 2 and 3 (HSat1, HSat2, HSat3, or collectively HSat1-3), occur on most human chromosomes as large, pericentromeric tandem repeat arrays, which together constitute roughly 3% of the human genome (100 megabases, on average). Even though HSat1-3 were among the first human DNA sequences to be isolated and characterized at the dawn of molecular biology, they have remained almost entirely missing from the human genome reference assembly for 20 years, hindering studies of their sequence, regulation, and potential structural roles in the nucleus. Recently, the Telomere-to-Telomere Consortium produced the first truly complete assembly of a human genome, paving the way for new studies of HSat1-3 with modern genomic tools. This review provides an account of the history and current understanding of HSat1-3, with a view towards future studies of their evolution and roles in health and disease.
Collapse
|
16
|
Yandım C, Karakülah G. Repeat expression is linked to patient survival and exhibits single nucleotide variation in pancreatic cancer revealing LTR70:r.879A>G. Gene X 2022; 822:146344. [PMID: 35183687 DOI: 10.1016/j.gene.2022.146344] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/17/2021] [Revised: 02/03/2022] [Accepted: 02/14/2022] [Indexed: 11/04/2022] Open
Abstract
Despite an overwhelming number of cancer literature reporting the links between patient survival and the expression levels of genes or mutations/single nucleotide variations (SNVs) on them, there is only limited information on repeat elements, which make at least half the human genome. Here, we analysed RNA-seq data obtained from primary pancreatic cancer tissues of 51 patients and revealed that two transposons, HERVI-int and X6A_LINE, showed an upregulation trend in the patients who lived shorter, along with 56 other potential repeats which were linked to survival. We also detected expressed single nucleotide variations (SNVs) on repeats, among which LTR70:r.879A>G stands out with the effect of its presence on this particular repeat's expression levels and a significant link to overall patient survival. Interestingly, the expression of LTR70:r.879A>G correlated with different cancer genes in comparison to its reference version highlighting the involvement of BRAF and Fumerate Hydratase with this expressed SNV. This is one of the first studies revealing possible links between repeat expression and survival in cancer and it warrants further research in this avenue.
Collapse
Affiliation(s)
- Cihangir Yandım
- İzmir University of Economics, Faculty of Engineering, Department of Genetics and Bioengineering, 35330 Balçova, İzmir, Turkey; İzmir Biomedicine and Genome Center (IBG), Dokuz Eylül University Health Campus, 35340 İnciraltı, İzmir, Turkey
| | - Gökhan Karakülah
- İzmir Biomedicine and Genome Center (IBG), Dokuz Eylül University Health Campus, 35340 İnciraltı, İzmir, Turkey; İzmir International Biomedicine and Genome Institute, Dokuz Eylül University, 35340 İnciraltı, İzmir, Turkey.
| |
Collapse
|
17
|
Saayman X, Esashi F. Breaking the paradigm: early insights from mammalian DNA breakomes. FEBS J 2022; 289:2409-2428. [PMID: 33792193 PMCID: PMC9451923 DOI: 10.1111/febs.15849] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/30/2020] [Revised: 03/04/2021] [Accepted: 03/29/2021] [Indexed: 12/13/2022]
Abstract
DNA double-strand breaks (DSBs) can result from both exogenous and endogenous sources and are potentially toxic lesions to the human genome. If improperly repaired, DSBs can threaten genome integrity and contribute to premature ageing, neurodegenerative disorders and carcinogenesis. Through decades of work on genome stability, it has become evident that certain regions of the genome are inherently more prone to breakage than others, known as genome instability hotspots. Recent advancements in sequencing-based technologies now enable the profiling of genome-wide distributions of DSBs, also known as breakomes, to systematically map these instability hotspots. Here, we review the application of these technologies and their implications for our current understanding of the genomic regions most likely to drive genome instability. These breakomes ultimately highlight both new and established breakage hotspots including actively transcribed regions, loop boundaries and early-replicating regions of the genome. Further, these breakomes challenge the paradigm that DNA breakage primarily occurs in hard-to-replicate regions. With these advancements, we begin to gain insights into the biological mechanisms both invoking and protecting against genome instability.
Collapse
Affiliation(s)
- Xanita Saayman
- Sir William Dunn School of Pathology, University of Oxford, UK
| | - Fumiko Esashi
- Sir William Dunn School of Pathology, University of Oxford, UK
| |
Collapse
|
18
|
Altemose N, Glennis A, Bzikadze AV, Sidhwani P, Langley SA, Caldas GV, Hoyt SJ, Uralsky L, Ryabov FD, Shew CJ, Sauria MEG, Borchers M, Gershman A, Mikheenko A, Shepelev VA, Dvorkina T, Kunyavskaya O, Vollger MR, Rhie A, McCartney AM, Asri M, Lorig-Roach R, Shafin K, Aganezov S, Olson D, de Lima LG, Potapova T, Hartley GA, Haukness M, Kerpedjiev P, Gusev F, Tigyi K, Brooks S, Young A, Nurk S, Koren S, Salama SR, Paten B, Rogaev EI, Streets A, Karpen GH, Dernburg AF, Sullivan BA, Straight AF, Wheeler TJ, Gerton JL, Eichler EE, Phillippy AM, Timp W, Dennis MY, O'Neill RJ, Zook JM, Schatz MC, Pevzner PA, Diekhans M, Langley CH, Alexandrov IA, Miga KH. Complete genomic and epigenetic maps of human centromeres. Science 2022; 376:eabl4178. [PMID: 35357911 PMCID: PMC9233505 DOI: 10.1126/science.abl4178] [Citation(s) in RCA: 174] [Impact Index Per Article: 87.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/20/2022]
Abstract
Existing human genome assemblies have almost entirely excluded repetitive sequences within and near centromeres, limiting our understanding of their organization, evolution, and functions, which include facilitating proper chromosome segregation. Now, a complete, telomere-to-telomere human genome assembly (T2T-CHM13) has enabled us to comprehensively characterize pericentromeric and centromeric repeats, which constitute 6.2% of the genome (189.9 megabases). Detailed maps of these regions revealed multimegabase structural rearrangements, including in active centromeric repeat arrays. Analysis of centromere-associated sequences uncovered a strong relationship between the position of the centromere and the evolution of the surrounding DNA through layered repeat expansions. Furthermore, comparisons of chromosome X centromeres across a diverse panel of individuals illuminated high degrees of structural, epigenetic, and sequence variation in these complex and rapidly evolving regions.
Collapse
Affiliation(s)
- Nicolas Altemose
- Department of Molecular and Cell Biology, University of California, Berkeley, Berkeley, CA, USA
| | - A. Glennis
- Department of Genome Sciences, University of Washington School of Medicine, Seattle, WA, USA
| | - Andrey V. Bzikadze
- Graduate Program in Bioinformatics and Systems Biology, University of California San Diego, La Jolla, CA, USA
| | - Pragya Sidhwani
- Department of Biochemistry, Stanford University, Stanford, CA, USA
| | - Sasha A. Langley
- Department of Molecular and Cell Biology, University of California, Berkeley, Berkeley, CA, USA
| | - Gina V. Caldas
- Department of Molecular and Cell Biology, University of California, Berkeley, Berkeley, CA, USA
| | - Savannah J. Hoyt
- Institute for Systems Genomics, University of Connecticut, Storrs, CT, USA
- Department of Molecular and Cell Biology, University of Connecticut, Storrs, CT, USA
| | - Lev Uralsky
- Sirius University of Science and Technology, Sochi, Russia
- Vavilov Institute of General Genetics, Moscow, Russia
| | | | - Colin J. Shew
- Genome Center, MIND Institute, and Department of Biochemistry and Molecular Medicine, School of Medicine, University of California, Davis, Davis, CA, USA
| | | | | | - Ariel Gershman
- Department of Molecular Biology and Genetics, Johns Hopkins University, Baltimore, MD, USA
| | - Alla Mikheenko
- Center for Algorithmic Biotechnology, Institute of Translational Biomedicine, Saint Petersburg State University, Saint Petersburg, Russia
| | | | - Tatiana Dvorkina
- Center for Algorithmic Biotechnology, Institute of Translational Biomedicine, Saint Petersburg State University, Saint Petersburg, Russia
| | - Olga Kunyavskaya
- Center for Algorithmic Biotechnology, Institute of Translational Biomedicine, Saint Petersburg State University, Saint Petersburg, Russia
| | - Mitchell R. Vollger
- Department of Genome Sciences, University of Washington School of Medicine, Seattle, WA, USA
| | - Arang Rhie
- Genome Informatics Section, Computational and Statistical Genomics Branch, National Human Genome Research Institute, National Institutes of Health, Bethesda, MD, USA
| | - Ann M. McCartney
- Genome Informatics Section, Computational and Statistical Genomics Branch, National Human Genome Research Institute, National Institutes of Health, Bethesda, MD, USA
| | - Mobin Asri
- UC Santa Cruz Genomics Institute, University of California Santa Cruz, Santa Cruz, CA, USA
| | - Ryan Lorig-Roach
- UC Santa Cruz Genomics Institute, University of California Santa Cruz, Santa Cruz, CA, USA
| | - Kishwar Shafin
- UC Santa Cruz Genomics Institute, University of California Santa Cruz, Santa Cruz, CA, USA
| | - Sergey Aganezov
- Department of Computer Science, Johns Hopkins University, Baltimore, MD, USA
| | - Daniel Olson
- Department of Computer Science, University of Montana, Missoula, MT. USA
| | | | - Tamara Potapova
- Stowers Institute for Medical Research, Kansas City, MO, USA
| | - Gabrielle A. Hartley
- Institute for Systems Genomics, University of Connecticut, Storrs, CT, USA
- Department of Molecular and Cell Biology, University of Connecticut, Storrs, CT, USA
| | - Marina Haukness
- UC Santa Cruz Genomics Institute, University of California Santa Cruz, Santa Cruz, CA, USA
| | | | - Fedor Gusev
- Vavilov Institute of General Genetics, Moscow, Russia
| | - Kristof Tigyi
- UC Santa Cruz Genomics Institute, University of California Santa Cruz, Santa Cruz, CA, USA
- Howard Hughes Medical Institute, Chevy Chase, MD, USA
| | - Shelise Brooks
- NIH Intramural Sequencing Center, National Human Genome Research Institute, National Institutes of Health, Bethesda, MD, USA
| | - Alice Young
- NIH Intramural Sequencing Center, National Human Genome Research Institute, National Institutes of Health, Bethesda, MD, USA
| | - Sergey Nurk
- Genome Informatics Section, Computational and Statistical Genomics Branch, National Human Genome Research Institute, National Institutes of Health, Bethesda, MD, USA
| | - Sergey Koren
- Genome Informatics Section, Computational and Statistical Genomics Branch, National Human Genome Research Institute, National Institutes of Health, Bethesda, MD, USA
| | - Sofie R. Salama
- UC Santa Cruz Genomics Institute, University of California Santa Cruz, Santa Cruz, CA, USA
- Howard Hughes Medical Institute, Chevy Chase, MD, USA
| | - Benedict Paten
- UC Santa Cruz Genomics Institute, University of California Santa Cruz, Santa Cruz, CA, USA
- Department of Biomolecular Engineering, University of California Santa Cruz, CA, USA
| | - Evgeny I. Rogaev
- Sirius University of Science and Technology, Sochi, Russia
- Vavilov Institute of General Genetics, Moscow, Russia
- Department of Psychiatry, University of Massachusetts Medical School, Worcester, MA, USA
- Faculty of Biology, Lomonosov Moscow State University, Moscow, Russia
| | - Aaron Streets
- Department of Bioengineering, University of California, Berkeley, Berkeley, CA, USA
- Chan Zuckerberg Biohub, San Francisco, CA, USA
| | - Gary H. Karpen
- Department of Molecular and Cell Biology, University of California, Berkeley, Berkeley, CA, USA
- BioEngineering and BioMedical Sciences Department, Lawrence Berkeley National Laboratory, Berkeley, CA, USA
| | - Abby F. Dernburg
- Department of Molecular and Cell Biology, University of California, Berkeley, Berkeley, CA, USA
- Howard Hughes Medical Institute, Chevy Chase, MD, USA
- Institute for Quantitative Biosciences (QB3), University of California, Berkeley, Berkeley, CA, USA
| | - Beth A. Sullivan
- Department of Molecular Genetics and Microbiology, Duke University School of Medicine, Durham, NC, USA
| | | | - Travis J. Wheeler
- Department of Computer Science, University of Montana, Missoula, MT. USA
| | - Jennifer L. Gerton
- Stowers Institute for Medical Research, Kansas City, MO, USA
- University of Kansas Medical School, Department of Biochemistry and Molecular Biology and Cancer Center, University of Kansas, Kansas City, KS, USA
| | - Evan E. Eichler
- Department of Genome Sciences, University of Washington School of Medicine, Seattle, WA, USA
- Howard Hughes Medical Institute, Chevy Chase, MD, USA
| | - Adam M. Phillippy
- Genome Informatics Section, Computational and Statistical Genomics Branch, National Human Genome Research Institute, National Institutes of Health, Bethesda, MD, USA
| | - Winston Timp
- Department of Molecular Biology and Genetics, Johns Hopkins University, Baltimore, MD, USA
- Department of Biomedical Engineering, Johns Hopkins University, Baltimore, MD, USA
| | - Megan Y. Dennis
- Genome Center, MIND Institute, and Department of Biochemistry and Molecular Medicine, School of Medicine, University of California, Davis, Davis, CA, USA
| | - Rachel J. O'Neill
- Institute for Systems Genomics, University of Connecticut, Storrs, CT, USA
- Department of Molecular and Cell Biology, University of Connecticut, Storrs, CT, USA
| | - Justin M. Zook
- Biosystems and Biomaterials Division, National Institute of Standards and Technology, Gaithersburg, MD, USA
| | - Michael C. Schatz
- Department of Computer Science, Johns Hopkins University, Baltimore, MD, USA
| | - Pavel A. Pevzner
- Department of Computer Science and Engineering, University of California at San Diego, San Diego, CA, USA
| | - Mark Diekhans
- UC Santa Cruz Genomics Institute, University of California Santa Cruz, Santa Cruz, CA, USA
| | - Charles H. Langley
- Department of Evolution and Ecology, University of California Davis, Davis, CA, USA
| | - Ivan A. Alexandrov
- Vavilov Institute of General Genetics, Moscow, Russia
- Center for Algorithmic Biotechnology, Institute of Translational Biomedicine, Saint Petersburg State University, Saint Petersburg, Russia
- Research Center of Biotechnology of the Russian Academy of Sciences, Moscow, Russia
| | - Karen H. Miga
- UC Santa Cruz Genomics Institute, University of California Santa Cruz, Santa Cruz, CA, USA
- Department of Biomolecular Engineering, University of California Santa Cruz, CA, USA
| |
Collapse
|
19
|
Antonarakis SE. Short arms of human acrocentric chromosomes and the completion of the human genome sequence. Genome Res 2022; 32:599-607. [PMID: 35361624 PMCID: PMC8997349 DOI: 10.1101/gr.275350.121] [Citation(s) in RCA: 8] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/24/2022]
Abstract
The complete, ungapped sequence of the short arms of human acrocentric chromosomes (SAACs) is still unknown almost 20 years after the near completion of the Human Genome Project. Yet these short arms of Chromosomes 13, 14, 15, 21, and 22 contain the ribosomal DNA (rDNA) genes, which are of paramount importance for human biology. The sequences of SAACs show an extensive variation in the copy number of the various repetitive elements, the full extent of which is currently unknown. In addition, the full spectrum of repeated sequences, their organization, and the low copy number functional elements are also unknown. The Telomere-to-Telomere (T2T) Project using mainly long-read sequence technology has recently completed the assembly of the genome from a hydatidiform mole, CHM13, and has thus established a baseline reference for further studies on the organization, variation, functional annotation, and impact in human disorders of all the previously unknown genomic segments, including the SAACs. The publication of the initial results of the T2T Project will update and improve the reference genome for a better understanding of the evolution and function of the human genome.
Collapse
Affiliation(s)
- Stylianos E Antonarakis
- Department of Genetic Medicine and Development, University of Geneva Medical Faculty, 1211 Geneva, Switzerland.,Foundation Campus Biotech, 1202 Geneva, Switzerland.,Medigenome, Swiss Institute of Genomic Medicine, 1207 Geneva, Switzerland
| |
Collapse
|
20
|
Vourc’h C, Dufour S, Timcheva K, Seigneurin-Berny D, Verdel A. HSF1-Activated Non-Coding Stress Response: Satellite lncRNAs and Beyond, an Emerging Story with a Complex Scenario. Genes (Basel) 2022; 13:genes13040597. [PMID: 35456403 PMCID: PMC9032817 DOI: 10.3390/genes13040597] [Citation(s) in RCA: 7] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/09/2022] [Revised: 03/18/2022] [Accepted: 03/19/2022] [Indexed: 12/21/2022] Open
Abstract
In eukaryotes, the heat shock response is orchestrated by a transcription factor named Heat Shock Factor 1 (HSF1). HSF1 is mostly characterized for its role in activating the expression of a repertoire of protein-coding genes, including the heat shock protein (HSP) genes. Remarkably, a growing set of reports indicate that, upon heat shock, HSF1 also targets various non-coding regions of the genome. Focusing primarily on mammals, this review aims at reporting the identity of the non-coding genomic sites directly bound by HSF1, and at describing the molecular function of the long non-coding RNAs (lncRNAs) produced in response to HSF1 binding. The described non-coding genomic targets of HSF1 are pericentric Satellite DNA repeats, (sub)telomeric DNA repeats, Short Interspersed Nuclear Element (SINE) repeats, transcriptionally active enhancers and the NEAT1 gene. This diverse set of non-coding genomic sites, which already appears to be an integral part of the cellular response to stress, may only represent the first of many. Thus, the study of the evolutionary conserved heat stress response has the potential to emerge as a powerful cellular context to study lncRNAs, produced from repeated or unique DNA regions, with a regulatory function that is often well-documented but a mode of action that remains largely unknown.
Collapse
Affiliation(s)
- Claire Vourc’h
- Université de Grenoble Alpes (UGA), 38700 La Tronche, France
- Correspondence: (C.V.); (A.V.)
| | - Solenne Dufour
- Institute for Advanced Biosciences (IAB), Centre de Recherche UGA/Inserm U 1209/CNRS UMR 5309, Site Santé-Allée des Alpes, 38700 La Tronche, France; (S.D.); (K.T.); (D.S.-B.)
| | - Kalina Timcheva
- Institute for Advanced Biosciences (IAB), Centre de Recherche UGA/Inserm U 1209/CNRS UMR 5309, Site Santé-Allée des Alpes, 38700 La Tronche, France; (S.D.); (K.T.); (D.S.-B.)
| | - Daphné Seigneurin-Berny
- Institute for Advanced Biosciences (IAB), Centre de Recherche UGA/Inserm U 1209/CNRS UMR 5309, Site Santé-Allée des Alpes, 38700 La Tronche, France; (S.D.); (K.T.); (D.S.-B.)
| | - André Verdel
- Institute for Advanced Biosciences (IAB), Centre de Recherche UGA/Inserm U 1209/CNRS UMR 5309, Site Santé-Allée des Alpes, 38700 La Tronche, France; (S.D.); (K.T.); (D.S.-B.)
- Correspondence: (C.V.); (A.V.)
| |
Collapse
|
21
|
Merkle FT, Ghosh S, Genovese G, Handsaker RE, Kashin S, Meyer D, Karczewski KJ, O'Dushlaine C, Pato C, Pato M, MacArthur DG, McCarroll SA, Eggan K. Whole-genome analysis of human embryonic stem cells enables rational line selection based on genetic variation. Cell Stem Cell 2022; 29:472-486.e7. [PMID: 35176222 PMCID: PMC8900618 DOI: 10.1016/j.stem.2022.01.011] [Citation(s) in RCA: 14] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/10/2020] [Revised: 10/29/2021] [Accepted: 01/24/2022] [Indexed: 02/02/2023]
Abstract
Despite their widespread use in research, there has not yet been a systematic genomic analysis of human embryonic stem cell (hESC) lines at a single-nucleotide resolution. We therefore performed whole-genome sequencing (WGS) of 143 hESC lines and annotated their single-nucleotide and structural genetic variants. We found that while a substantial fraction of hESC lines contained large deleterious structural variants, finer-scale structural and single-nucleotide variants (SNVs) that are ascertainable only through WGS analyses were present in hESC genomes and human blood-derived genomes at similar frequencies. Moreover, WGS allowed us to identify SNVs associated with cancer and other diseases that could alter cellular phenotypes and compromise the safety of hESC-derived cellular products transplanted into humans. As a resource to enable reproducible hESC research and safer translation, we provide a user-friendly WGS data portal and a data-driven scheme for cell line maintenance and selection.
Collapse
Affiliation(s)
- Florian T Merkle
- Department of Stem Cell and Regenerative Biology, Harvard University, Cambridge, MA 02138, USA; Department of Molecular and Cellular Biology, Harvard University, Cambridge, MA 02138, USA; Harvard Stem Cell Institute, Cambridge, MA 02138, USA; Stanley Center for Psychiatric Research, Broad Institute of MIT and Harvard, Cambridge, MA 02142, USA; Wellcome - MRC Institute of Metabolic Science, University of Cambridge, Cambridge CB2 0QQ, UK; Wellcome - MRC Cambridge Stem Cell Institute, University of Cambridge, Cambridge CB2 0AW, UK.
| | - Sulagna Ghosh
- Department of Stem Cell and Regenerative Biology, Harvard University, Cambridge, MA 02138, USA; Department of Molecular and Cellular Biology, Harvard University, Cambridge, MA 02138, USA; Harvard Stem Cell Institute, Cambridge, MA 02138, USA; Stanley Center for Psychiatric Research, Broad Institute of MIT and Harvard, Cambridge, MA 02142, USA
| | - Giulio Genovese
- Stanley Center for Psychiatric Research, Broad Institute of MIT and Harvard, Cambridge, MA 02142, USA; Department of Genetics, Harvard Medical School, Boston, MA 02115, USA; Program in Medical and Population Genetics, Broad Institute of MIT and Harvard, Cambridge, MA 02142, USA
| | - Robert E Handsaker
- Stanley Center for Psychiatric Research, Broad Institute of MIT and Harvard, Cambridge, MA 02142, USA; Department of Genetics, Harvard Medical School, Boston, MA 02115, USA; Program in Medical and Population Genetics, Broad Institute of MIT and Harvard, Cambridge, MA 02142, USA
| | - Seva Kashin
- Stanley Center for Psychiatric Research, Broad Institute of MIT and Harvard, Cambridge, MA 02142, USA; Department of Genetics, Harvard Medical School, Boston, MA 02115, USA; Program in Medical and Population Genetics, Broad Institute of MIT and Harvard, Cambridge, MA 02142, USA
| | - Daniel Meyer
- Stanley Center for Psychiatric Research, Broad Institute of MIT and Harvard, Cambridge, MA 02142, USA; Program in Medical and Population Genetics, Broad Institute of MIT and Harvard, Cambridge, MA 02142, USA
| | - Konrad J Karczewski
- Program in Medical and Population Genetics, Broad Institute of MIT and Harvard, Cambridge, MA 02142, USA; Analytic and Translational Genetics Unit, Massachusetts General Hospital, Boston, MA 02114, USA
| | - Colm O'Dushlaine
- Stanley Center for Psychiatric Research, Broad Institute of MIT and Harvard, Cambridge, MA 02142, USA
| | - Carlos Pato
- Department of Psychiatry, Robert Wood Johnson Medical School, Rutgers University, New Brunswick, NJ 08901, USA; Department of Psychiatry, New Jersey Medical School, Rutgers University, Newark, NJ 07103, USA
| | - Michele Pato
- Department of Psychiatry, Robert Wood Johnson Medical School, Rutgers University, New Brunswick, NJ 08901, USA; Department of Psychiatry, New Jersey Medical School, Rutgers University, Newark, NJ 07103, USA
| | - Daniel G MacArthur
- Department of Genetics, Harvard Medical School, Boston, MA 02115, USA; Program in Medical and Population Genetics, Broad Institute of MIT and Harvard, Cambridge, MA 02142, USA; Centre for Population Genomics, Garvan Institute of Medical Research, and UNSW Sydney, Sydney, NSW, Australia; Centre for Population Genomics, Murdoch Children's Research Institute, Melbourne, VIC, Australia
| | - Steven A McCarroll
- Stanley Center for Psychiatric Research, Broad Institute of MIT and Harvard, Cambridge, MA 02142, USA; Department of Genetics, Harvard Medical School, Boston, MA 02115, USA; Program in Medical and Population Genetics, Broad Institute of MIT and Harvard, Cambridge, MA 02142, USA.
| | - Kevin Eggan
- Department of Stem Cell and Regenerative Biology, Harvard University, Cambridge, MA 02138, USA; Department of Molecular and Cellular Biology, Harvard University, Cambridge, MA 02138, USA; Harvard Stem Cell Institute, Cambridge, MA 02138, USA; Stanley Center for Psychiatric Research, Broad Institute of MIT and Harvard, Cambridge, MA 02142, USA.
| |
Collapse
|
22
|
Abstract
Melanoma is the most lethal skin cancer that originates from the malignant transformation of melanocytes. Although melanoma has long been regarded as a cancerous malignancy with few therapeutic options, increased biological understanding and unprecedented innovations in therapies targeting mutated driver genes and immune checkpoints have substantially improved the prognosis of patients. However, the low response rate and inevitable occurrence of resistance to currently available targeted therapies have posed the obstacle in the path of melanoma management to obtain further amelioration. Therefore, it is necessary to understand the mechanisms underlying melanoma pathogenesis more comprehensively, which might lead to more substantial progress in therapeutic approaches and expand clinical options for melanoma therapy. In this review, we firstly make a brief introduction to melanoma epidemiology, clinical subtypes, risk factors, and current therapies. Then, the signal pathways orchestrating melanoma pathogenesis, including genetic mutations, key transcriptional regulators, epigenetic dysregulations, metabolic reprogramming, crucial metastasis-related signals, tumor-promoting inflammatory pathways, and pro-angiogenic factors, have been systemically reviewed and discussed. Subsequently, we outline current progresses in therapies targeting mutated driver genes and immune checkpoints, as well as the mechanisms underlying the treatment resistance. Finally, the prospects and challenges in the development of melanoma therapy, especially immunotherapy and related ongoing clinical trials, are summarized and discussed.
Collapse
Affiliation(s)
- Weinan Guo
- Department of Dermatology, Xijing Hospital, Fourth Military Medical University, No. 127 of West Changle Road, 710032, Xi'an, Shaanxi, China
| | - Huina Wang
- Department of Dermatology, Xijing Hospital, Fourth Military Medical University, No. 127 of West Changle Road, 710032, Xi'an, Shaanxi, China
| | - Chunying Li
- Department of Dermatology, Xijing Hospital, Fourth Military Medical University, No. 127 of West Changle Road, 710032, Xi'an, Shaanxi, China.
| |
Collapse
|
23
|
Abstract
We are entering a new era in genomics where entire centromeric regions are accurately represented in human reference assemblies. Access to these high-resolution maps will enable new surveys of sequence and epigenetic variation in the population and offer new insight into satellite array genomics and centromere function. Here, we focus on the sequence organization and evolution of alpha satellites, which are credited as the genetic and genomic definition of human centromeres due to their interaction with inner kinetochore proteins and their importance in the development of human artificial chromosome assays. We provide an overview of alpha satellite repeat structure and array organization in the context of these high-quality reference data sets; discuss the emergence of variation-based surveys; and provide perspective on the role of this new source of genetic and epigenetic variation in the context of chromosome biology, genome instability, and human disease.
Collapse
Affiliation(s)
- Karen H Miga
- UC Santa Cruz Genomics Institute, University of California, Santa Cruz, California 95064, USA; .,Department of Biomolecular Engineering, University of California, Santa Cruz, California 95064, USA
| | - Ivan A Alexandrov
- Department of Genomics and Human Genetics, Vavilov Institute of General Genetics, Russian Academy of Sciences, Moscow 119991, Russia; .,Center for Algorithmic Biotechnology, Institute of Translational Biomedicine, Saint Petersburg State University, Saint Petersburg 199004, Russia.,Research Center of Biotechnology of the Russian Academy of Sciences, Moscow 119071, Russia
| |
Collapse
|
24
|
Miga KH, Sullivan BA. Expanding studies of chromosome structure and function in the era of T2T genomics. Hum Mol Genet 2021; 30:R198-R205. [PMID: 34302168 PMCID: PMC8631062 DOI: 10.1093/hmg/ddab214] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/02/2021] [Revised: 07/16/2021] [Accepted: 07/20/2021] [Indexed: 11/13/2022] Open
Abstract
The recent accomplishment of a truly complete human genome has afforded a new view of chromosome structure and function that was limited 30 years ago. Here, we discuss the expansion of knowledge from the early cytological studies of the genome to the current high-resolution genomic, epigenetic and functional maps that have been achieved by recent technology and computational advances. These studies have revealed unexpected complexities of genome organization and function and uncovered new views of fundamental chromosomal elements. Comprehensive genomic maps will enable accurate diagnosis of human diseases caused by altered chromosome structure and function, facilitate development of chromosome-based therapies and shape the future of preventative medicine and healthcare.
Collapse
Affiliation(s)
- Karen H Miga
- UC Santa Cruz Genomics Institute, University of California Santa Cruz, Santa Cruz, CA, USA
- Department of Biomolecular Engineering, University of California Santa Cruz, Santa Cruz, CA, USA
| | - Beth A Sullivan
- Department of Molecular Genetics and Microbiology, Duke University School of Medicine, Durham, NC, USA
| |
Collapse
|
25
|
Hausmann F, Kurtz S. DeepGRP: engineering a software tool for predicting genomic repetitive elements using Recurrent Neural Networks with attention. Algorithms Mol Biol 2021; 16:20. [PMID: 34425870 PMCID: PMC8381506 DOI: 10.1186/s13015-021-00199-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/05/2021] [Accepted: 08/03/2021] [Indexed: 12/30/2022] Open
Abstract
BACKGROUND Repetitive elements contribute a large part of eukaryotic genomes. For example, about 40 to 50% of human, mouse and rat genomes are repetitive. So identifying and classifying repeats is an important step in genome annotation. This annotation step is traditionally performed using alignment based methods, either in a de novo approach or by aligning the genome sequence to a species specific set of repetitive sequences. Recently, Li (Bioinformatics 35:4408-4410, 2019) developed a novel software tool dna-brnn to annotate repetitive sequences using a recurrent neural network trained on sample annotations of repetitive elements. RESULTS We have developed the methods of dna-brnn further and engineered a new software tool DeepGRP. This combines the basic concepts of Li (Bioinformatics 35:4408-4410, 2019) with current techniques developed for neural machine translation, the attention mechanism, for the task of nucleotide-level annotation of repetitive elements. An evaluation on the human genome shows a 20% improvement of the Matthews correlation coefficient for the predictions delivered by DeepGRP, when compared to dna-brnn. DeepGRP predicts two additional classes of repeats (compared to dna-brnn) and is able to transfer repeat annotations, using RepeatMasker-based training data to a different species (mouse). Additionally, we could show that DeepGRP predicts repeats annotated in the Dfam database, but not annotated by RepeatMasker. DeepGRP is highly scalable due to its implementation in the TensorFlow framework. For example, the GPU-accelerated version of DeepGRP is approx. 1.8 times faster than dna-brnn, approx. 8.6 times faster than RepeatMasker and over 100 times faster than HMMER searching for models of the Dfam database. CONCLUSIONS By incorporating methods from neural machine translation, DeepGRP achieves a consistent improvement of the quality of the predictions compared to dna-brnn. Improved running times are obtained by employing TensorFlow as implementation framework and the use of GPUs. By incorporating two additional classes of repeats, DeepGRP provides more complete annotations, which were evaluated against three state-of-the-art tools for repeat annotation.
Collapse
Affiliation(s)
- Fabian Hausmann
- Institute of Medical Systems Biology, University Medical Center Hamburg-Eppendorf, Falkenried 94, 20251 Hamburg, Germany
| | - Stefan Kurtz
- ZBH - Center for Bioinformatics, MIN-Fakultät, Universität Hamburg, Bundesstrasse 43, 20146 Hamburg, Germany
| |
Collapse
|
26
|
Using de novo assembly to identify structural variation of eight complex immune system gene regions. PLoS Comput Biol 2021; 17:e1009254. [PMID: 34343164 PMCID: PMC8363018 DOI: 10.1371/journal.pcbi.1009254] [Citation(s) in RCA: 10] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/04/2021] [Revised: 08/13/2021] [Accepted: 07/06/2021] [Indexed: 12/11/2022] Open
Abstract
Driven by the necessity to survive environmental pathogens, the human immune system has evolved exceptional diversity and plasticity, to which several factors contribute including inheritable structural polymorphism of the underlying genes. Characterizing this variation is challenging due to the complexity of these loci, which contain extensive regions of paralogy, segmental duplication and high copy-number repeats, but recent progress in long-read sequencing and optical mapping techniques suggests this problem may now be tractable. Here we assess this by using long-read sequencing platforms from PacBio and Oxford Nanopore, supplemented with short-read sequencing and Bionano optical mapping, to sequence DNA extracted from CD14+ monocytes and peripheral blood mononuclear cells from a single European individual identified as HV31. We use this data to build a de novo assembly of eight genomic regions encoding four key components of the immune system, namely the human leukocyte antigen, immunoglobulins, T cell receptors, and killer-cell immunoglobulin-like receptors. Validation of our assembly using k-mer based and alignment approaches suggests that it has high accuracy, with estimated base-level error rates below 1 in 10 kb, although we identify a small number of remaining structural errors. We use the assembly to identify heterozygous and homozygous structural variation in comparison to GRCh38. Despite analyzing only a single individual, we find multiple large structural variants affecting core genes at all three immunoglobulin regions and at two of the three T cell receptor regions. Several of these variants are not accurately callable using current algorithms, implying that further methodological improvements are needed. Our results demonstrate that assessing haplotype variation in these regions is possible given sufficiently accurate long-read and associated data. Continued reductions in the cost of these technologies will enable application of these methods to larger samples and provide a broader catalogue of germline structural variation at these loci, an important step toward making these regions accessible to large-scale genetic association studies. The human immune system is incredibly versatile underlying its capacity to defend the body against thousands of pathogens. At a molecular level, it recognizes pathogens using large libraries of antibodies and related protein receptors. These molecules are encoded by gene families that are particularly difficult to analyze due to their unusually complex patterns of similarities and differences between genes and individuals. To overcome this, we applied several sequencing methods to DNA from a single individual and developed methods to reconstruct the underlying sequence at eight of the immune-associated regions. Importantly, we used DNA extracted from monocytes to avoid capturing the further rearrangements that occur in active immune cells. We generated accurate assemblies by integrating multiple complementary data types, although we noted a small subset of locations that remain challenging. Moreover, we found that this individual contains multiple structural differences between the two inherited chromosomes and compared to previously analyzed genomes, affecting the copy number of immune system genes. Application of these methods in larger numbers of individuals will clearly uncover much more variation than is currently known, and might lead to new understanding of the effect of genetic variation on the broad range of human diseases determined by the immune response.
Collapse
|
27
|
Lopes M, Louzada S, Gama-Carvalho M, Chaves R. Genomic Tackling of Human Satellite DNA: Breaking Barriers through Time. Int J Mol Sci 2021; 22:4707. [PMID: 33946766 PMCID: PMC8125562 DOI: 10.3390/ijms22094707] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/31/2021] [Revised: 04/24/2021] [Accepted: 04/27/2021] [Indexed: 12/12/2022] Open
Abstract
(Peri)centromeric repetitive sequences and, more specifically, satellite DNA (satDNA) sequences, constitute a major human genomic component. SatDNA sequences can vary on a large number of features, including nucleotide composition, complexity, and abundance. Several satDNA families have been identified and characterized in the human genome through time, albeit at different speeds. Human satDNA families present a high degree of sub-variability, leading to the definition of various subfamilies with different organization and clustered localization. Evolution of satDNA analysis has enabled the progressive characterization of satDNA features. Despite recent advances in the sequencing of centromeric arrays, comprehensive genomic studies to assess their variability are still required to provide accurate and proportional representation of satDNA (peri)centromeric/acrocentric short arm sequences. Approaches combining multiple techniques have been successfully applied and seem to be the path to follow for generating integrated knowledge in the promising field of human satDNA biology.
Collapse
Affiliation(s)
- Mariana Lopes
- Laboratory of Cytogenomics and Animal Genomics (CAG), Department of Genetics and Biotechnology (DGB), University of Trás-os-Montes and Alto Douro (UTAD), 5000-801 Vila Real, Portugal; (M.L.); (S.L.)
- Biosystems and Integrative Sciences Institute (BioISI), Faculty of Sciences, University of Lisbon, 1749-016 Lisbon, Portugal;
| | - Sandra Louzada
- Laboratory of Cytogenomics and Animal Genomics (CAG), Department of Genetics and Biotechnology (DGB), University of Trás-os-Montes and Alto Douro (UTAD), 5000-801 Vila Real, Portugal; (M.L.); (S.L.)
- Biosystems and Integrative Sciences Institute (BioISI), Faculty of Sciences, University of Lisbon, 1749-016 Lisbon, Portugal;
| | - Margarida Gama-Carvalho
- Biosystems and Integrative Sciences Institute (BioISI), Faculty of Sciences, University of Lisbon, 1749-016 Lisbon, Portugal;
| | - Raquel Chaves
- Laboratory of Cytogenomics and Animal Genomics (CAG), Department of Genetics and Biotechnology (DGB), University of Trás-os-Montes and Alto Douro (UTAD), 5000-801 Vila Real, Portugal; (M.L.); (S.L.)
- Biosystems and Integrative Sciences Institute (BioISI), Faculty of Sciences, University of Lisbon, 1749-016 Lisbon, Portugal;
| |
Collapse
|
28
|
Landers CC, Rabeler CA, Ferrari EK, D'Alessandro LR, Kang DD, Malisa J, Bashir SM, Carone DM. Ectopic expression of pericentric HSATII RNA results in nuclear RNA accumulation, MeCP2 recruitment, and cell division defects. Chromosoma 2021; 130:75-90. [PMID: 33585981 PMCID: PMC7889552 DOI: 10.1007/s00412-021-00753-0] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/11/2020] [Revised: 01/16/2021] [Accepted: 01/19/2021] [Indexed: 12/21/2022]
Abstract
Within the pericentric regions of human chromosomes reside large arrays of tandemly repeated satellite sequences. Expression of the human pericentric satellite HSATII is prevented by extensive heterochromatin silencing in normal cells, yet in many cancer cells, HSATII RNA is aberrantly expressed and accumulates in large nuclear foci in cis. Expression and aggregation of HSATII RNA in cancer cells is concomitant with recruitment of key chromatin regulatory proteins including methyl-CpG binding protein 2 (MeCP2). While HSATII expression has been observed in a wide variety of cancer cell lines and tissues, the effect of its expression is unknown. We tested the effect of stable expression of HSATII RNA within cells that do not normally express HSATII. Ectopic HSATII expression in HeLa and primary fibroblast cells leads to focal accumulation of HSATII RNA in cis and triggers the accumulation of MeCP2 onto nuclear HSATII RNA bodies. Further, long-term expression of HSATII RNA leads to cell division defects including lagging chromosomes, chromatin bridges, and other chromatin defects. Thus, expression of HSATII RNA in normal cells phenocopies its nuclear accumulation in cancer cells and allows for the characterization of the cellular events triggered by aberrant expression of pericentric satellite RNA.
Collapse
Affiliation(s)
- Catherine C Landers
- Department of Nutritional Sciences, University of Connecticut , Storrs, CT, USA
| | | | | | | | - Diana D Kang
- Division of Pharmaceutics and Pharmacology College of Pharmacy, Ohio State University, Columbus, OH, USA
| | - Jessica Malisa
- Stanford University School of Medicine, Stanford, CA, USA
| | - Safia M Bashir
- Department of Biology, Swarthmore College, Swarthmore, PA, USA
| | - Dawn M Carone
- Department of Biology, Swarthmore College, Swarthmore, PA, USA.
| |
Collapse
|
29
|
Burley JT, Kellner JR, Hubbell SP, Faircloth BC. Genome assemblies for two Neotropical trees: Jacaranda copaia and Handroanthus guayacan. G3 (BETHESDA, MD.) 2021; 11:jkab010. [PMID: 33693604 PMCID: PMC8034707 DOI: 10.1093/g3journal/jkab010] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 09/16/2020] [Accepted: 12/22/2020] [Indexed: 12/01/2022]
Abstract
The lack of genomic resources for tropical canopy trees is impeding several research avenues in tropical forest biology. We present genome assemblies for two Neotropical hardwood species, Jacaranda copaia and Handroanthus (formerly Tabebuia) guayacan, that are model systems for research on tropical tree demography and flowering phenology. For each species, we combined Illumina short-read data with in vitro proximity-ligation (Chicago) libraries to generate an assembly. For Jacaranda copaia, we obtained 104X physical coverage and produced an assembly with N50/N90 scaffold lengths of 1.020/0.277 Mbp. For H. guayacan, we obtained 129X coverage and produced an assembly with N50/N90 scaffold lengths of 0.795/0.165 Mbp. J. copaia and H. guayacan assemblies contained 95.8% and 87.9% of benchmarking orthologs, although they constituted only 77.1% and 66.7% of the estimated genome sizes of 799 and 512 Mbp, respectively. These differences were potentially due to high repetitive sequence content (>59.31% and 45.59%) and high heterozygosity (0.5% and 0.8%) in each species. Finally, we compared each new assembly to a previously sequenced genome for Handroanthus impetiginosus using whole-genome alignment. This analysis indicated extensive gene duplication in H. impetiginosus since its divergence from H. guayacan.
Collapse
Affiliation(s)
- John T Burley
- Department of Ecology and Evolutionary Biology, Brown University, Providence, RI 02912, USA
- Institute at Brown for Environment and Society, Brown University, Providence, RI 02912, USA
| | - James R Kellner
- Department of Ecology and Evolutionary Biology, Brown University, Providence, RI 02912, USA
- Institute at Brown for Environment and Society, Brown University, Providence, RI 02912, USA
| | - Stephen P Hubbell
- Department of Ecology and Evolutionary Biology, University of California—Los Angeles, Los Angeles, CA 90095, USA
| | - Brant C Faircloth
- Department of Biological Sciences and Museum of Natural Science, Louisiana State University, Baton Rouge, LA 70803, USA
| |
Collapse
|
30
|
Cechova M. Probably Correct: Rescuing Repeats with Short and Long Reads. Genes (Basel) 2020; 12:48. [PMID: 33396198 PMCID: PMC7823596 DOI: 10.3390/genes12010048] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/09/2020] [Revised: 12/23/2020] [Accepted: 12/24/2020] [Indexed: 02/07/2023] Open
Abstract
Ever since the introduction of high-throughput sequencing following the human genome project, assembling short reads into a reference of sufficient quality posed a significant problem as a large portion of the human genome-estimated 50-69%-is repetitive. As a result, a sizable proportion of sequencing reads is multi-mapping, i.e., without a unique placement in the genome. The two key parameters for whether or not a read is multi-mapping are the read length and genome complexity. Long reads are now able to span difficult, heterochromatic regions, including full centromeres, and characterize chromosomes from "telomere to telomere". Moreover, identical reads or repeat arrays can be differentiated based on their epigenetic marks, such as methylation patterns, aiding in the assembly process. This is despite the fact that long reads still contain a modest percentage of sequencing errors, disorienting the aligners and assemblers both in accuracy and speed. Here, I review the proposed and implemented solutions to the repeat resolution and the multi-mapping read problem, as well as the downstream consequences of reference choice, repeat masking, and proper representation of sex chromosomes. I also consider the forthcoming challenges and solutions with regards to long reads, where we expect the shift from the problem of repeat localization within a single individual to the problem of repeat positioning within pangenomes.
Collapse
Affiliation(s)
- Monika Cechova
- Genetics and Reproductive Biotechnologies, Veterinary Research Institute, Central European Institute of Technology (CEITEC), 621 00 Brno, Czech Republic
| |
Collapse
|
31
|
Ahmad SF, Singchat W, Jehangir M, Suntronpong A, Panthum T, Malaivijitnond S, Srikulnath K. Dark Matter of Primate Genomes: Satellite DNA Repeats and Their Evolutionary Dynamics. Cells 2020; 9:E2714. [PMID: 33352976 PMCID: PMC7767330 DOI: 10.3390/cells9122714] [Citation(s) in RCA: 25] [Impact Index Per Article: 6.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/27/2020] [Revised: 12/15/2020] [Accepted: 12/16/2020] [Indexed: 12/12/2022] Open
Abstract
A substantial portion of the primate genome is composed of non-coding regions, so-called "dark matter", which includes an abundance of tandemly repeated sequences called satellite DNA. Collectively known as the satellitome, this genomic component offers exciting evolutionary insights into aspects of primate genome biology that raise new questions and challenge existing paradigms. A complete human reference genome was recently reported with telomere-to-telomere human X chromosome assembly that resolved hundreds of dark regions, encompassing a 3.1 Mb centromeric satellite array that had not been identified previously. With the recent exponential increase in the availability of primate genomes, and the development of modern genomic and bioinformatics tools, extensive growth in our knowledge concerning the structure, function, and evolution of satellite elements is expected. The current state of knowledge on this topic is summarized, highlighting various types of primate-specific satellite repeats to compare their proportions across diverse lineages. Inter- and intraspecific variation of satellite repeats in the primate genome are reviewed. The functional significance of these sequences is discussed by describing how the transcriptional activity of satellite repeats can affect gene expression during different cellular processes. Sex-linked satellites are outlined, together with their respective genomic organization. Mechanisms are proposed whereby satellite repeats might have emerged as novel sequences during different evolutionary phases. Finally, the main challenges that hinder the detection of satellite DNA are outlined and an overview of the latest methodologies to address technological limitations is presented.
Collapse
Affiliation(s)
- Syed Farhan Ahmad
- Laboratory of Animal Cytogenetics and Comparative Genomics (ACCG), Department of Genetics, Faculty of Science, Kasetsart University, Bangkok 10900, Thailand; (S.F.A.); (W.S.); (M.J.); (A.S.); (T.P.)
- Special Research Unit for Wildlife Genomics (SRUWG), Department of Forest Biology, Faculty of Forestry, Kasetsart University, Bangkok 10900, Thailand
| | - Worapong Singchat
- Laboratory of Animal Cytogenetics and Comparative Genomics (ACCG), Department of Genetics, Faculty of Science, Kasetsart University, Bangkok 10900, Thailand; (S.F.A.); (W.S.); (M.J.); (A.S.); (T.P.)
- Special Research Unit for Wildlife Genomics (SRUWG), Department of Forest Biology, Faculty of Forestry, Kasetsart University, Bangkok 10900, Thailand
| | - Maryam Jehangir
- Laboratory of Animal Cytogenetics and Comparative Genomics (ACCG), Department of Genetics, Faculty of Science, Kasetsart University, Bangkok 10900, Thailand; (S.F.A.); (W.S.); (M.J.); (A.S.); (T.P.)
- Department of Structural and Functional Biology, Institute of Bioscience at Botucatu, São Paulo State University (UNESP), Botucatu, São Paulo 18618-689, Brazil
| | - Aorarat Suntronpong
- Laboratory of Animal Cytogenetics and Comparative Genomics (ACCG), Department of Genetics, Faculty of Science, Kasetsart University, Bangkok 10900, Thailand; (S.F.A.); (W.S.); (M.J.); (A.S.); (T.P.)
- Special Research Unit for Wildlife Genomics (SRUWG), Department of Forest Biology, Faculty of Forestry, Kasetsart University, Bangkok 10900, Thailand
| | - Thitipong Panthum
- Laboratory of Animal Cytogenetics and Comparative Genomics (ACCG), Department of Genetics, Faculty of Science, Kasetsart University, Bangkok 10900, Thailand; (S.F.A.); (W.S.); (M.J.); (A.S.); (T.P.)
- Special Research Unit for Wildlife Genomics (SRUWG), Department of Forest Biology, Faculty of Forestry, Kasetsart University, Bangkok 10900, Thailand
| | - Suchinda Malaivijitnond
- National Primate Research Center of Thailand, Chulalongkorn University, Saraburi 18110, Thailand;
- Department of Biology, Faculty of Science, Chulalongkorn University, Bangkok 10330, Thailand
| | - Kornsorn Srikulnath
- Laboratory of Animal Cytogenetics and Comparative Genomics (ACCG), Department of Genetics, Faculty of Science, Kasetsart University, Bangkok 10900, Thailand; (S.F.A.); (W.S.); (M.J.); (A.S.); (T.P.)
- Special Research Unit for Wildlife Genomics (SRUWG), Department of Forest Biology, Faculty of Forestry, Kasetsart University, Bangkok 10900, Thailand
- National Primate Research Center of Thailand, Chulalongkorn University, Saraburi 18110, Thailand;
- Center of Excellence on Agricultural Biotechnology (AG-BIO/PERDO-CHE), Bangkok 10900, Thailand
- Omics Center for Agriculture, Bioresources, Food and Health, Kasetsart University (OmiKU), Bangkok 10900, Thailand
| |
Collapse
|
32
|
de Lima LG, Hanlon SL, Gerton JL. Origins and Evolutionary Patterns of the 1.688 Satellite DNA Family in Drosophila Phylogeny. G3 (BETHESDA, MD.) 2020; 10:4129-4146. [PMID: 32934018 PMCID: PMC7642928 DOI: 10.1534/g3.120.401727] [Citation(s) in RCA: 12] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 06/06/2020] [Accepted: 09/09/2020] [Indexed: 12/11/2022]
Abstract
Satellite DNAs (satDNAs) are a ubiquitous feature of eukaryotic genomes and are usually the major components of constitutive heterochromatin. The 1.688 satDNA, also known as the 359 bp satellite, is one of the most abundant repetitive sequences in Drosophila melanogaster and has been linked to several different biological functions. We investigated the presence and evolution of the 1.688 satDNA in 16 Drosophila genomes. We find that the 1.688 satDNA family is much more ancient than previously appreciated, being shared among part of the melanogaster group that diverged from a common ancestor ∼27 Mya. We found that the 1.688 satDNA family has two major subfamilies spread throughout Drosophila phylogeny (∼360 bp and ∼190 bp). Phylogenetic analysis of ∼10,000 repeats extracted from 14 of the species revealed that the 1.688 satDNA family is present within heterochromatin and euchromatin. A high number of euchromatic repeats are gene proximal, suggesting the potential for local gene regulation. Notably, heterochromatic copies display concerted evolution and a species-specific pattern, whereas euchromatic repeats display a more typical evolutionary pattern, suggesting that chromatin domains may influence the evolution of these sequences. Overall, our data indicate the 1.688 satDNA as the most perduring satDNA family described in Drosophila phylogeny to date. Our study provides a strong foundation for future work on the functional roles of 1.688 satDNA across many Drosophila species.
Collapse
Affiliation(s)
| | - Stacey L Hanlon
- Stowers Institute for Medical Research, Kansas City, Missouri 64110
| | | |
Collapse
|
33
|
Shadle SC, Bennett SR, Wong CJ, Karreman NA, Campbell AE, van der Maarel SM, Bass BL, Tapscott SJ. DUX4-induced bidirectional HSATII satellite repeat transcripts form intranuclear double-stranded RNA foci in human cell models of FSHD. Hum Mol Genet 2020; 28:3997-4011. [PMID: 31630170 DOI: 10.1093/hmg/ddz242] [Citation(s) in RCA: 24] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/12/2019] [Revised: 09/19/2019] [Accepted: 10/03/2019] [Indexed: 12/29/2022] Open
Abstract
The DUX4 transcription factor is normally expressed in the cleavage-stage embryo and regulates genes involved in embryonic genome activation. Misexpression of DUX4 in skeletal muscle, however, is toxic and causes facioscapulohumeral muscular dystrophy (FSHD). We recently showed DUX4-induced toxicity is due, in part, to the activation of the double-stranded RNA (dsRNA) response pathway and the accumulation of intranuclear dsRNA foci. Here, we determined the composition of DUX4-induced dsRNAs. We found that a subset of DUX4-induced dsRNAs originate from inverted Alu repeats embedded within the introns of DUX4-induced transcripts and from DUX4-induced dsRNA-forming intergenic transcripts enriched for endogenous retroviruses, Alu and LINE-1 elements. However, these repeat classes were also represented in dsRNAs from cells not expressing DUX4. In contrast, pericentric human satellite II (HSATII) repeats formed a class of dsRNA specific to the DUX4 expressing cells. Further investigation revealed that DUX4 can initiate the bidirectional transcription of normally heterochromatin-silenced HSATII repeats. DUX4-induced HSATII RNAs co-localized with DUX4-induced nuclear dsRNA foci and with intranuclear aggregation of EIF4A3 and ADAR1. Finally, gapmer-mediated knockdown of HSATII transcripts depleted DUX4-induced intranuclear ribonucleoprotein aggregates and decreased DUX4-induced cell death, suggesting that HSATII-formed dsRNAs contribute to DUX4 toxicity.
Collapse
Affiliation(s)
- Sean C Shadle
- Human Biology Division, Fred Hutchinson Cancer Research Center, Seattle, WA 98109, USA.,Molecular and Cellular Biology Program, University of Washington, Seattle, WA 91805, USA
| | - Sean R Bennett
- Human Biology Division, Fred Hutchinson Cancer Research Center, Seattle, WA 98109, USA
| | - Chao-Jen Wong
- Human Biology Division, Fred Hutchinson Cancer Research Center, Seattle, WA 98109, USA
| | - Nancy A Karreman
- Human Biology Division, Fred Hutchinson Cancer Research Center, Seattle, WA 98109, USA
| | - Amy E Campbell
- Human Biology Division, Fred Hutchinson Cancer Research Center, Seattle, WA 98109, USA
| | | | - Brenda L Bass
- Department of Biochemistry, University of Utah, Salt Lake City, UT 84112, USA
| | - Stephen J Tapscott
- Human Biology Division, Fred Hutchinson Cancer Research Center, Seattle, WA 98109, USA
| |
Collapse
|
34
|
Satellitome Analysis in the Ladybird Beetle Hippodamia variegata (Coleoptera, Coccinellidae). Genes (Basel) 2020; 11:genes11070783. [PMID: 32668664 PMCID: PMC7397073 DOI: 10.3390/genes11070783] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/18/2020] [Revised: 07/09/2020] [Accepted: 07/09/2020] [Indexed: 12/29/2022] Open
Abstract
Hippodamia variegata is one of the most commercialized ladybirds used for the biological control of aphid pest species in many economically important crops. This species is the first Coccinellidae whose satellitome has been studied by applying new sequencing technologies and bioinformatics tools. We found that 47% of the H. variegata genome is composed of repeated sequences. We identified 30 satellite DNA (satDNA) families with a median intragenomic divergence of 5.75% and A+T content between 45.6% and 74.7%. This species shows satDNA families with highly variable sizes although the most common size is 100–200 bp. However, we highlight the existence of a satDNA family with a repeat unit of 2 kb, the largest repeat unit described in Coleoptera. PCR amplifications for fluorescence in situ hybridization (FISH) probe generation were performed for the four most abundant satDNA families. FISH with the most abundant satDNA family as a probe shows its pericentromeric location on all chromosomes. This location is coincident with the heterochromatin revealed by C-banding and DAPI staining, also analyzed in this work. Hybridization signals for other satDNA families were located only on certain bivalents and the X chromosome. These satDNAs could be very useful as chromosomal markers due to their reduced location.
Collapse
|
35
|
Miga KH. Centromere studies in the era of 'telomere-to-telomere' genomics. Exp Cell Res 2020; 394:112127. [PMID: 32504677 DOI: 10.1016/j.yexcr.2020.112127] [Citation(s) in RCA: 25] [Impact Index Per Article: 6.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/08/2020] [Revised: 05/23/2020] [Accepted: 05/30/2020] [Indexed: 12/17/2022]
Abstract
We are entering into an exciting era of genomics where truly complete, high-quality assemblies of human chromosomes are available end-to-end, or from 'telomere-to-telomere' (T2T). This technological advance offers a new opportunity to include endogenous human centromeric regions in high-resolution, sequence-based studies. These emerging reference maps are expected to reveal a new functional landscape in the human genome, where centromere proteins, transcriptional regulation, and spatial organization can be examined with base-level resolution across different stages of development and disease. Such studies will depend on innovative assembly methods of extremely long tandem repeats (ETRs), or satellite DNAs, paired with the development of new, orthogonal validation methods to ensure accuracy and completeness. This review reflects the progress in centromere genomics, credited by recent advancements in long-read sequencing and assembly methods. In doing so, I will discuss the challenges that remain and the promise for a new period of scientific discovery for satellite DNA biology and centromere function.
Collapse
Affiliation(s)
- Karen H Miga
- UC Santa Cruz Genomics Institute, University of California, Santa Cruz, CA, CA, 95064, USA.
| |
Collapse
|
36
|
Akkipeddi SMK, Velleca AJ, Carone DM. Probing the function of long noncoding RNAs in the nucleus. Chromosome Res 2020; 28:87-110. [PMID: 32026224 PMCID: PMC7131881 DOI: 10.1007/s10577-019-09625-x] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/01/2019] [Revised: 12/20/2019] [Accepted: 12/29/2019] [Indexed: 12/26/2022]
Abstract
The nucleus is a highly organized and dynamic environment where regulation and coordination of processes such as gene expression and DNA replication are paramount. In recent years, noncoding RNAs have emerged as key participants in the regulation of nuclear processes. There are a multitude of functional roles for long noncoding RNA (lncRNA), mediated through their ability to act as molecular scaffolds bridging interactions with proteins, chromatin, and other RNA molecules within the nuclear environment. In this review, we discuss the diversity of techniques that have been developed to probe the function of nuclear lncRNAs, along with the ways in which those techniques have revealed insights into their mechanisms of action. Foundational observations into lncRNA function have been gleaned from molecular cytology-based, single-cell approaches to illuminate both the localization and abundance of lncRNAs in addition to their potential binding partners. Biochemical, extraction-based approaches have revealed the molecular contacts between lncRNAs and other molecules within the nuclear environment and how those interactions may contribute to nuclear organization and regulation. Using examples of well-studied nuclear lncRNAs, we demonstrate that the emerging functions of individual lncRNAs have been most clearly deduced from combined cytology and biochemical approaches tailored to study specific lncRNAs. As more functional nuclear lncRNAs continue to emerge, the development of additional technologies to study their interactions and mechanisms of action promise to continually expand our understanding of nuclear organization, chromosome architecture, genome regulation, and disease states.
Collapse
Affiliation(s)
| | - Anthony J Velleca
- Department of Molecular Phamacology, Memorial Sloan Kettering Cancer Center, New York, NY, USA
| | - Dawn M Carone
- Department of Biology, Swarthmore College, Swarthmore, PA, USA.
| |
Collapse
|
37
|
O'Neill RJ. Seq'ing identity and function in a repeat-derived noncoding RNA world. Chromosome Res 2020; 28:111-127. [PMID: 32146545 PMCID: PMC7393779 DOI: 10.1007/s10577-020-09628-z] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/11/2019] [Revised: 02/07/2020] [Accepted: 02/14/2020] [Indexed: 01/06/2023]
Abstract
Innovations in high-throughout sequencing approaches are being marshaled to both reveal the composition of the abundant and heterogeneous noncoding RNAs that populate cell nuclei and lend insight to the mechanisms by which noncoding RNAs influence chromosome biology and gene expression. This review focuses on some of the recent technological developments that have enabled the isolation of nascent transcripts and chromatin-associated and DNA-interacting RNAs. Coupled with emerging genome assembly and analytical approaches, the field is poised to achieve a comprehensive catalog of nuclear noncoding RNAs, including those derived from repetitive regions within eukaryotic genomes. Herein, particular attention is paid to the challenges and advances in the sequence analyses of repeat and transposable element-derived noncoding RNAs and in ascribing specific function(s) to such RNAs.
Collapse
Affiliation(s)
- Rachel J O'Neill
- Institute for Systems Genomics, University of Connecticut, Storrs, CT, 06269, USA.
- Department of Molecular and Cell Biology, University of Connecticut, Storrs, CT, 06269, USA.
- Department of Genetics and Genome Sciences, University of Connecticut Health Center, Farmington, CT, 06030, USA.
| |
Collapse
|
38
|
Puppo IL, Saifitdinova AF, Tonyan ZN. The Role of Satellite DNA in Causing Structural Rearrangements in Human Karyotype. RUSS J GENET+ 2020. [DOI: 10.1134/s1022795419080155] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/05/2023]
|
39
|
Abstract
ChIP-Seq blacklists contain genomic regions that frequently produce artifacts and noise in ChIP-Seq experiments. To improve signal-to-noise ratio, ChIP-Seq pipelines often remove data points that map to blacklist regions. Existing blacklists have been compiled in a manual or semiautomated way. In this article we describe PeakPass, an efficient method to generate blacklists, and demonstrate that blacklists can increase ChIP-Seq data quality. PeakPass leverages machine learning and attempts to automate blacklist generation. PeakPass uses a random forest classifier in combination with genomic features such as sequence, annotated repeats, complexity, assembly gaps, and the ratio of multimapping to uniquely mapping reads to identify artifact regions. We have validated PeakPass on a large data set and tested it for the purpose of upgrading a blacklist to a new reference genome version. We trained PeakPass on the ENCODE blacklist for the hg19 human reference genome, and created an updated blacklist for hg38. To assess the performance of this blacklist, we tested 42 ChIP-Seq replicates from 24 experiments using 10 ChIP-Seq quality metrics including relative strand coefficient, standardized standard deviation, and enrichment of reads in promoter regions. Using the blacklist generated by PeakPass resulted in a statistically significant improvement for nine of these metrics.
Collapse
Affiliation(s)
- Charles E Wimberley
- Department of Computer Science, NC State University, Raleigh, North Carolina
| | - Steffen Heber
- Department of Computer Science, NC State University, Raleigh, North Carolina
| |
Collapse
|
40
|
|
41
|
Abstract
The cellular response to heat shock requires massive adaptation of gene expression driven by the transcription factor HSF1, which assembles in nuclear stress bodies together with human satellite III RNA and numerous splicing factors. In this issue of The EMBO Journal, Ninomiya et al demonstrate that nuclear stress bodies serve as a platform for phosphorylation of the SR protein SRSF9 by the CLK1 kinase, which promotes retention of a large number of introns during the recovery phase from heat shock.
Collapse
Affiliation(s)
- Sylvia Erhardt
- Center for Molecular Biology of Heidelberg University (ZMBH), DKFZ-ZMBH Alliance, Heidelberg, Germany
| | - Georg Stoecklin
- Center for Molecular Biology of Heidelberg University (ZMBH), DKFZ-ZMBH Alliance, Heidelberg, Germany.,Mannheim Institute for Innate Immunoscience (MI3), Medical Faculty Mannheim, Heidelberg University, Mannheim, Germany
| |
Collapse
|
42
|
Achrem M, Szućko I, Kalinka A. The epigenetic regulation of centromeres and telomeres in plants and animals. COMPARATIVE CYTOGENETICS 2020; 14:265-311. [PMID: 32733650 PMCID: PMC7360632 DOI: 10.3897/compcytogen.v14i2.51895] [Citation(s) in RCA: 20] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 03/09/2020] [Accepted: 05/18/2020] [Indexed: 05/10/2023]
Abstract
The centromere is a chromosomal region where the kinetochore is formed, which is the attachment point of spindle fibers. Thus, it is responsible for the correct chromosome segregation during cell division. Telomeres protect chromosome ends against enzymatic degradation and fusions, and localize chromosomes in the cell nucleus. For this reason, centromeres and telomeres are parts of each linear chromosome that are necessary for their proper functioning. More and more research results show that the identity and functions of these chromosomal regions are epigenetically determined. Telomeres and centromeres are both usually described as highly condensed heterochromatin regions. However, the epigenetic nature of centromeres and telomeres is unique, as epigenetic modifications characteristic of both eu- and heterochromatin have been found in these areas. This specificity allows for the proper functioning of both regions, thereby affecting chromosome homeostasis. This review focuses on demonstrating the role of epigenetic mechanisms in the functioning of centromeres and telomeres in plants and animals.
Collapse
Affiliation(s)
- Magdalena Achrem
- Institute of Biology, University of Szczecin, Szczecin, PolandUniversity of SzczecinSzczecinPoland
- Molecular Biology and Biotechnology Center, University of Szczecin, Szczecin, PolandUniversity of SzczecinSzczecinPoland
| | - Izabela Szućko
- Institute of Biology, University of Szczecin, Szczecin, PolandUniversity of SzczecinSzczecinPoland
- Molecular Biology and Biotechnology Center, University of Szczecin, Szczecin, PolandUniversity of SzczecinSzczecinPoland
| | - Anna Kalinka
- Institute of Biology, University of Szczecin, Szczecin, PolandUniversity of SzczecinSzczecinPoland
- Molecular Biology and Biotechnology Center, University of Szczecin, Szczecin, PolandUniversity of SzczecinSzczecinPoland
| |
Collapse
|
43
|
LIO CHANWANGJ, YUE XIAOJING, LÓPEZ-MOYADO ISAACF, TAHILIANI MAMTA, ARAVIND L, RAO ANJANA. TET methylcytosine oxidases: new insights from a decade of research. J Biosci 2020; 45:21. [PMID: 31965999 PMCID: PMC7216820] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/10/2023]
Abstract
In mammals, DNA methyltransferases transfer a methyl group from S-adenosylmethionine to the 5 position of cytosine in DNA. The product of this reaction, 5-methylcytosine (5mC), has many roles, particularly in suppressing transposable and repeat elements in DNA. Moreover, in many cellular systems, cell lineage specification is accompanied by DNA demethylation at the promoters of genes expressed at high levels in the differentiated cells. However, since direct cleavage of the C-C bond connecting the methyl group to the 5 position of cytosine is thermodynamically disfavoured, the question of whether DNA methylation was reversible remained unclear for many decades. This puzzle was solved by our discovery of the TET (Ten- Eleven Translocation) family of 5-methylcytosine oxidases, which use reduced iron, molecular oxygen and the tricarboxylic acid cycle metabolite 2-oxoglutarate (also known as a-ketoglutarate) to oxidise the methyl group of 5mC to 5-hydroxymethylcytosine (5hmC) and beyond. TET-generated oxidised methylcytosines are intermediates in at least two pathways of DNA demethylation, which differ in their dependence on DNA replication. In the decade since their discovery, TET enzymes have been shown to have important roles in embryonic development, cell lineage specification, neuronal function and cancer. We review these findings and discuss their implications here.
Collapse
Affiliation(s)
- CHAN-WANG J. LIO
- Division of Signaling and Gene Expression, La Jolla Institute for Immunology, La Jolla, CA 92037, USA
| | - XIAOJING YUE
- Division of Signaling and Gene Expression, La Jolla Institute for Immunology, La Jolla, CA 92037, USA
| | - ISAAC F. LÓPEZ-MOYADO
- Division of Signaling and Gene Expression, La Jolla Institute for Immunology, La Jolla, CA 92037, USA
- Bioinformatics and Systems Biology Graduate Program, University of California San Diego, La Jolla, CA 92093, USA
- Sanford Consortium for Regenerative Medicine, La Jolla, CA 92093, USA
| | - MAMTA TAHILIANI
- Skirball Institute of Biomolecular Medicine, New York University School of Medicine, New York, NY 10012, USA
- Department of Biology, New York University, New York, NY 10003, USA
| | - L. ARAVIND
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD 20892, USA
| | - ANJANA RAO
- Division of Signaling and Gene Expression, La Jolla Institute for Immunology, La Jolla, CA 92037, USA
- Bioinformatics and Systems Biology Graduate Program, University of California San Diego, La Jolla, CA 92093, USA
- Sanford Consortium for Regenerative Medicine, La Jolla, CA 92093, USA
- Department of Pharmacology, University of California San Diego, La Jolla, CA 92093, USA
- Moores Cancer Center, University of California San Diego, La Jolla, CA 92093, USA
| |
Collapse
|
44
|
Abstract
Marsupial genomes, which are packaged into large chromosomes, provide a powerful resource for studying the mechanisms of genome evolution. The extensive and valuable body of work on marsupial cytogenetics, combined more recently with genome sequence data, has enabled prediction of the 2n = 14 karyotype ancestral to all marsupial families. The application of both chromosome biology and genome sequencing, or chromosomics, has been a necessary approach for various aspects of mammalian genome evolution, such as understanding sex chromosome evolution and the origin and evolution of transmissible tumors in Tasmanian devils. The next phase of marsupial genome evolution research will employ chromosomics approaches to begin addressing fundamental questions in marsupial genome evolution and chromosome evolution more generally. The answers to these complex questions will impact our understanding across a broad range of fields, including the genetics of speciation, genome adaptation to environmental stressors, and species management.
Collapse
Affiliation(s)
- Janine E Deakin
- Institute for Applied Ecology, University of Canberra, Canberra, Australian Capital Territory 2617, Australia;
| | - Rachel J O'Neill
- Department of Molecular and Cell Biology and Institute for Systems Genomics, University of Connecticut, Storrs, Connecticut 06269, USA;
| |
Collapse
|
45
|
Harris RS, Cechova M, Makova KD. Noise-cancelling repeat finder: uncovering tandem repeats in error-prone long-read sequencing data. Bioinformatics 2019; 35:4809-4811. [PMID: 31290946 PMCID: PMC6853708 DOI: 10.1093/bioinformatics/btz484] [Citation(s) in RCA: 28] [Impact Index Per Article: 5.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/27/2018] [Revised: 04/24/2019] [Accepted: 07/09/2019] [Indexed: 12/31/2022] Open
Abstract
SUMMARY Tandem DNA repeats can be sequenced with long-read technologies, but cannot be accurately deciphered due to the lack of computational tools taking high error rates of these technologies into account. Here we introduce Noise-Cancelling Repeat Finder (NCRF) to uncover putative tandem repeats of specified motifs in noisy long reads produced by Pacific Biosciences and Oxford Nanopore sequencers. Using simulations, we validated the use of NCRF to locate tandem repeats with motifs of various lengths and demonstrated its superior performance as compared to two alternative tools. Using real human whole-genome sequencing data, NCRF identified long arrays of the (AATGG)n repeat involved in heat shock stress response. AVAILABILITY AND IMPLEMENTATION NCRF is implemented in C, supported by several python scripts, and is available in bioconda and at https://github.com/makovalab-psu/NoiseCancellingRepeatFinder. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Robert S Harris
- Department of Biology, The Pennsylvania State University, State College, PA 16802, USA
| | - Monika Cechova
- Department of Biology, The Pennsylvania State University, State College, PA 16802, USA
| | - Kateryna D Makova
- Department of Biology, The Pennsylvania State University, State College, PA 16802, USA
- Center for Medical Genomics, The Pennsylvania State University, State College, PA 16802, USA
| |
Collapse
|
46
|
Cechova M, Harris RS, Tomaszkiewicz M, Arbeithuber B, Chiaromonte F, Makova KD. High satellite repeat turnover in great apes studied with short- and long-read technologies. Mol Biol Evol 2019; 36:2415-2431. [PMID: 31273383 PMCID: PMC6805231 DOI: 10.1093/molbev/msz156] [Citation(s) in RCA: 24] [Impact Index Per Article: 4.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/12/2019] [Revised: 06/12/2019] [Accepted: 06/13/2019] [Indexed: 12/23/2022] Open
Abstract
Satellite repeats are a structural component of centromeres and telomeres, and in some instances, their divergence is known to drive speciation. Due to their highly repetitive nature, satellite sequences have been understudied and underrepresented in genome assemblies. To investigate their turnover in great apes, we studied satellite repeats of unit sizes up to 50 bp in human, chimpanzee, bonobo, gorilla, and Sumatran and Bornean orangutans, using unassembled short and long sequencing reads. The density of satellite repeats, as identified from accurate short reads (Illumina), varied greatly among great ape genomes. These were dominated by a handful of abundant repeated motifs, frequently shared among species, which formed two groups: 1) the (AATGG)n repeat (critical for heat shock response) and its derivatives; and 2) subtelomeric 32-mers involved in telomeric metabolism. Using the densities of abundant repeats, individuals could be classified into species. However, clustering did not reproduce the accepted species phylogeny, suggesting rapid repeat evolution. Several abundant repeats were enriched in males versus females; using Y chromosome assemblies or Fluorescent In Situ Hybridization, we validated their location on the Y. Finally, applying a novel computational tool, we identified many satellite repeats completely embedded within long Oxford Nanopore and Pacific Biosciences reads. Such repeats were up to 59 kb in length and consisted of perfect repeats interspersed with other similar sequences. Our results based on sequencing reads generated with three different technologies provide the first detailed characterization of great ape satellite repeats, and open new avenues for exploring their functions.
Collapse
Affiliation(s)
- Monika Cechova
- Department of Biology, Pennsylvania State University, University Park, PA USA
| | - Robert S Harris
- Department of Biology, Pennsylvania State University, University Park, PA USA
| | - Marta Tomaszkiewicz
- Department of Biology, Pennsylvania State University, University Park, PA USA
| | - Barbara Arbeithuber
- Department of Biology, Pennsylvania State University, University Park, PA USA
| | - Francesca Chiaromonte
- Department of Statistics, Pennsylvania State University, University Park, PA USA.,EMbeDS, Sant'Anna School of Advanced Studies, Pisa, Italy.,Center for Medical Genomics, Penn State, University Park, PA USA
| | | |
Collapse
|
47
|
Ebler J, Haukness M, Pesout T, Marschall T, Paten B. Haplotype-aware diplotyping from noisy long reads. Genome Biol 2019; 20:116. [PMID: 31159868 PMCID: PMC6547545 DOI: 10.1186/s13059-019-1709-0] [Citation(s) in RCA: 33] [Impact Index Per Article: 6.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/25/2018] [Accepted: 05/06/2019] [Indexed: 12/19/2022] Open
Abstract
Current genotyping approaches for single-nucleotide variations rely on short, accurate reads from second-generation sequencing devices. Presently, third-generation sequencing platforms are rapidly becoming more widespread, yet approaches for leveraging their long but error-prone reads for genotyping are lacking. Here, we introduce a novel statistical framework for the joint inference of haplotypes and genotypes from noisy long reads, which we term diplotyping. Our technique takes full advantage of linkage information provided by long reads. We validate hundreds of thousands of candidate variants that have not yet been included in the high-confidence reference set of the Genome-in-a-Bottle effort.
Collapse
Affiliation(s)
- Jana Ebler
- Center for Bioinformatics, Saarland University, Saarland Informatics Campus E2.1, Saarbrücken, 66123, Germany
- Max Planck Institute for Informatics, Saarland Informatics Campus E1.4, Saarbrücken, Germany
- Graduate School of Computer Science, Saarland University, Saarland Informatics Campus E1.3, Saarbrücken, Germany
| | - Marina Haukness
- UC Santa Cruz Genomics Institute, University of California Santa Cruz, Santa Cruz, 95064, CA, USA
| | - Trevor Pesout
- UC Santa Cruz Genomics Institute, University of California Santa Cruz, Santa Cruz, 95064, CA, USA
| | - Tobias Marschall
- Center for Bioinformatics, Saarland University, Saarland Informatics Campus E2.1, Saarbrücken, 66123, Germany.
- Max Planck Institute for Informatics, Saarland Informatics Campus E1.4, Saarbrücken, Germany.
| | - Benedict Paten
- UC Santa Cruz Genomics Institute, University of California Santa Cruz, Santa Cruz, 95064, CA, USA.
| |
Collapse
|
48
|
Duda Z, Trusiak S, O'Neill R. Centromere Transcription: Means and Motive. PROGRESS IN MOLECULAR AND SUBCELLULAR BIOLOGY 2019; 56:257-281. [PMID: 28840241 DOI: 10.1007/978-3-319-58592-5_11] [Citation(s) in RCA: 11] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 02/05/2023]
Abstract
The chromosome biology field at large has benefited from studies of the cell cycle components, protein cascades and genomic landscape that are required for centromere identity, assembly and stable transgenerational inheritance. Research over the past 20 years has challenged the classical descriptions of a centromere as a stable, unmutable, and transcriptionally silent chromosome component. Instead, based on studies from a broad range of eukaryotic species, including yeast, fungi, plants, and animals, the centromere has been redefined as one of the more dynamic areas of the eukaryotic genome, requiring coordination of protein complex assembly, chromatin assembly, and transcriptional activity in a cell cycle specific manner. What has emerged from more recent studies is the realization that the transcription of specific types of nucleic acids is a key process in defining centromere integrity and function. To illustrate the transcriptional landscape of centromeres across eukaryotes, we focus this review on how transcripts interact with centromere proteins, when in the cell cycle centromeric transcription occurs, and what types of sequences are being transcribed. Utilizing data from broadly different organisms, a picture emerges that places centromeric transcription as an integral component of centromere function.
Collapse
Affiliation(s)
- Zachary Duda
- Department of Molecular and Cell Biology, The Institute for Systems Genomics, University of Connecticut, Storrs, CT, 06269, USA
| | - Sarah Trusiak
- Department of Molecular and Cell Biology, The Institute for Systems Genomics, University of Connecticut, Storrs, CT, 06269, USA
| | - Rachel O'Neill
- Department of Molecular and Cell Biology, The Institute for Systems Genomics, University of Connecticut, Storrs, CT, 06269, USA.
| |
Collapse
|
49
|
Miga KH. Centromeric Satellite DNAs: Hidden Sequence Variation in the Human Population. Genes (Basel) 2019; 10:E352. [PMID: 31072070 PMCID: PMC6562703 DOI: 10.3390/genes10050352] [Citation(s) in RCA: 55] [Impact Index Per Article: 11.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/02/2019] [Revised: 05/03/2019] [Accepted: 05/03/2019] [Indexed: 12/30/2022] Open
Abstract
The central goal of medical genomics is to understand the inherited basis of sequence variation that underlies human physiology, evolution, and disease. Functional association studies currently ignore millions of bases that span each centromeric region and acrocentric short arm. These regions are enriched in long arrays of tandem repeats, or satellite DNAs, that are known to vary extensively in copy number and repeat structure in the human population. Satellite sequence variation in the human genome is often so large that it is detected cytogenetically, yet due to the lack of a reference assembly and informatics tools to measure this variability, contemporary high-resolution disease association studies are unable to detect causal variants in these regions. Nevertheless, recently uncovered associations between satellite DNA variation and human disease support that these regions present a substantial and biologically important fraction of human sequence variation. Therefore, there is a pressing and unmet need to detect and incorporate this uncharacterized sequence variation into broad studies of human evolution and medical genomics. Here I discuss the current knowledge of satellite DNA variation in the human genome, focusing on centromeric satellites and their potential implications for disease.
Collapse
Affiliation(s)
- Karen H Miga
- UC Santa Cruz Genomics Institute, University of California, Santa Cruz, California, CA 95064, USA.
| |
Collapse
|
50
|
Breitwieser FP, Pertea M, Zimin AV, Salzberg SL. Human contamination in bacterial genomes has created thousands of spurious proteins. Genome Res 2019; 29:954-960. [PMID: 31064768 PMCID: PMC6581058 DOI: 10.1101/gr.245373.118] [Citation(s) in RCA: 78] [Impact Index Per Article: 15.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/24/2018] [Accepted: 05/03/2019] [Indexed: 01/22/2023]
Abstract
Contaminant sequences that appear in published genomes can cause numerous problems for downstream analyses, particularly for evolutionary studies and metagenomics projects. Our large-scale scan of complete and draft bacterial and archaeal genomes in the NCBI RefSeq database reveals that 2250 genomes are contaminated by human sequence. The contaminant sequences derive primarily from high-copy human repeat regions, which themselves are not adequately represented in the current human reference genome, GRCh38. The absence of the sequences from the human assembly offers a likely explanation for their presence in bacterial assemblies. In some cases, the contaminating contigs have been erroneously annotated as containing protein-coding sequences, which over time have propagated to create spurious protein “families” across multiple prokaryotic and eukaryotic genomes. As a result, 3437 spurious protein entries are currently present in the widely used nr and TrEMBL protein databases. We report here an extensive list of contaminant sequences in bacterial genome assemblies and the proteins associated with them. We found that nearly all contaminants occurred in small contigs in draft genomes, which suggests that filtering out small contigs from draft genome assemblies may mitigate the issue of contamination while still keeping nearly all of the genuine genomic sequences.
Collapse
Affiliation(s)
- Florian P Breitwieser
- Center for Computational Biology, McKusick-Nathans Institute of Genetic Medicine, Johns Hopkins School of Medicine, Baltimore, Maryland 21205, USA
| | - Mihaela Pertea
- Center for Computational Biology, McKusick-Nathans Institute of Genetic Medicine, Johns Hopkins School of Medicine, Baltimore, Maryland 21205, USA.,Department of Computer Science, Whiting School of Engineering, Johns Hopkins University, Baltimore, Maryland 21218, USA
| | - Aleksey V Zimin
- Center for Computational Biology, McKusick-Nathans Institute of Genetic Medicine, Johns Hopkins School of Medicine, Baltimore, Maryland 21205, USA.,Department of Biomedical Engineering, Johns Hopkins University, Baltimore, Maryland 21218, USA
| | - Steven L Salzberg
- Center for Computational Biology, McKusick-Nathans Institute of Genetic Medicine, Johns Hopkins School of Medicine, Baltimore, Maryland 21205, USA.,Department of Computer Science, Whiting School of Engineering, Johns Hopkins University, Baltimore, Maryland 21218, USA.,Department of Biomedical Engineering, Johns Hopkins University, Baltimore, Maryland 21218, USA.,Department of Biostatistics, Bloomberg School of Public Health, Johns Hopkins University, Baltimore, Maryland 21205, USA
| |
Collapse
|