1
|
Li H, Durbin R. Genome assembly in the telomere-to-telomere era. Nat Rev Genet 2024; 25:658-670. [PMID: 38649458 DOI: 10.1038/s41576-024-00718-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 02/27/2024] [Indexed: 04/25/2024]
Abstract
Genome sequences largely determine the biology and encode the history of an organism, and de novo assembly - the process of reconstructing the genome sequence of an organism from sequencing reads - has been a central problem in bioinformatics for four decades. Until recently, genomes were typically assembled into fragments of a few megabases at best, but now technological advances in long-read sequencing enable the near-complete assembly of each chromosome - also known as telomere-to-telomere assembly - for many organisms. Here, we review recent progress on assembly algorithms and protocols, with a focus on how to derive near-telomere-to-telomere assemblies. We also discuss the additional developments that will be required to resolve remaining assembly gaps and to assemble non-diploid genomes.
Collapse
Affiliation(s)
- Heng Li
- Department of Data Science, Dana-Farber Cancer Institute, Boston, MA, USA.
- Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA.
| | - Richard Durbin
- Department of Genetics, Cambridge University, Cambridge, UK.
| |
Collapse
|
2
|
Zhang S, Xu N, Fu L, Yang X, Li Y, Yang Z, Feng Y, Ma K, Jiang X, Han J, Hu R, Zhang L, de Gennaro L, Ryabov F, Meng D, He Y, Wu D, Yang C, Paparella A, Mao Y, Bian X, Lu Y, Antonacci F, Ventura M, Shepelev VA, Miga KH, Alexandrov IA, Logsdon GA, Phillippy AM, Su B, Zhang G, Eichler EE, Lu Q, Shi Y, Sun Q, Mao Y. Comparative genomics of macaques and integrated insights into genetic variation and population history. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.04.07.588379. [PMID: 38645259 PMCID: PMC11030432 DOI: 10.1101/2024.04.07.588379] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 04/23/2024]
Abstract
The crab-eating macaques ( Macaca fascicularis ) and rhesus macaques ( M. mulatta ) are widely studied nonhuman primates in biomedical and evolutionary research. Despite their significance, the current understanding of the complex genomic structure in macaques and the differences between species requires substantial improvement. Here, we present a complete genome assembly of a crab-eating macaque and 20 haplotype-resolved macaque assemblies to investigate the complex regions and major genomic differences between species. Segmental duplication in macaques is ∼42% lower, while centromeres are ∼3.7 times longer than those in humans. The characterization of ∼2 Mbp fixed genetic variants and ∼240 Mbp complex loci highlights potential associations with metabolic differences between the two macaque species (e.g., CYP2C76 and EHBP1L1 ). Additionally, hundreds of alternative splicing differences show post-transcriptional regulation divergence between these two species (e.g., PNPO ). We also characterize 91 large-scale genomic differences between macaques and humans at a single-base-pair resolution and highlight their impact on gene regulation in primate evolution (e.g., FOLH1 and PIEZO2 ). Finally, population genetics recapitulates macaque speciation and selective sweeps, highlighting potential genetic basis of reproduction and tail phenotype differences (e.g., STAB1 , SEMA3F , and HOXD13 ). In summary, the integrated analysis of genetic variation and population genetics in macaques greatly enhances our comprehension of lineage-specific phenotypes, adaptation, and primate evolution, thereby improving their biomedical applications in human diseases.
Collapse
|
3
|
Greshnova A, Pál K, Martinez JFI, Canzar S, Makova KD. Transcript Isoform Diversity of Y Chromosome Ampliconic Genes of Great Apes Uncovered Using Long Reads and Telomere-to-Telomere Reference Genome Assemblies. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.04.02.587783. [PMID: 38617276 PMCID: PMC11014635 DOI: 10.1101/2024.04.02.587783] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 04/16/2024]
Abstract
Y chromosomes of great apes harbor Ampliconic Genes (YAGs)-multi-copy gene families (BPY2, CDY, DAZ, HSFY, PRY, RBMY, TSPY, VCY, and XKRY) that encode proteins important for spermatogenesis. Previous work assembled YAG transcripts based on their targeted sequencing but not using reference genome assemblies, potentially resulting in an incomplete transcript repertoire. Here we used the recently produced gapless telomere-to-telomere (T2T) Y chromosome assemblies of great ape species (bonobo, chimpanzee, human, gorilla, Bornean orangutan, and Sumatran orangutan) and analyzed RNA data from whole-testis samples for the same species. We generated hybrid transcriptome assemblies by combining targeted long reads (Pacific Biosciences), untargeted long reads (Pacific Biosciences) and untargeted short reads (Illumina)and mapping them to the T2T reference genomes. Compared to the results from the reference-free approach, average transcript length was more than two times higher, and the total number of transcripts decreased three times, improving the quality of the assembled transcriptome. The reference-based transcriptome assemblies allowed us to differentiate transcripts originating from different Y chromosome gene copies and from their non-Y chromosome homologs. We identified two sources of transcriptome diversity-alternative splicing and gene duplication with subsequent diversification of gene copies. For each gene family, we detected transcribed pseudogenes along with protein-coding gene copies. We revealed previously unannotated gene copies of YAGs as compared to currently available NCBI annotations, as well as novel isoforms for annotated gene copies. This analysis paves the way for better understanding Y chromosome gene functions, which is important given their role in spermatogenesis.
Collapse
Affiliation(s)
- Aleksandra Greshnova
- Department of Biology, Penn State University, University Park, PA, USA
- Current address: Max Planck Institute for Evolutionary Biology, Plön, Germany
| | - Karol Pál
- Department of Biology, Penn State University, University Park, PA, USA
| | - Juan Francisco Iturralde Martinez
- Department of Computer Science and Engineering, The Pennsylvania State University, University Park, PA 16802, United States
- Huck Institutes of the Life Sciences. Pennsylvania State University, University Park, PA 16802, USA
| | - Stefan Canzar
- Faculty of Informatics and Data Science, University of Regensburg, Regensburg, Germany
- Department of Computer Science and Engineering, The Pennsylvania State University, University Park, PA 16802, United States
| | - Kateryna D Makova
- Department of Biology, Penn State University, University Park, PA, USA
| |
Collapse
|
4
|
Plender EG, Prodanov T, Hsieh P, Nizamis E, Harvey WT, Sulovari A, Munson KM, Kaufman EJ, O'Neal WK, Valdmanis PN, Marschall T, Bloom JD, Eichler EE. Structural and genetic diversity in the secreted mucins, MUC5AC and MUC5B. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.03.18.585560. [PMID: 38562829 PMCID: PMC10983947 DOI: 10.1101/2024.03.18.585560] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 04/04/2024]
Abstract
The secreted mucins MUC5AC and MUC5B play critical defensive roles in airway pathogen entrapment and mucociliary clearance by encoding large glycoproteins with variable number tandem repeats (VNTRs). These polymorphic and degenerate protein coding VNTRs make the loci difficult to investigate with short reads. We characterize the structural diversity of MUC5AC and MUC5B by long-read sequencing and assembly of 206 human and 20 nonhuman primate (NHP) haplotypes. We find that human MUC5B is largely invariant (5761-5762aa); however, seven haplotypes have expanded VNTRs (6291-7019aa). In contrast, 30 allelic variants of MUC5AC encode 16 distinct proteins (5249-6325aa) with cysteine-rich domain and VNTR copy number variation. We grouped MUC5AC alleles into three phylogenetic clades: H1 (46%, ~5654aa), H2 (33%, ~5742aa), and H3 (7%, ~6325aa). The two most common human MUC5AC variants are smaller than NHP gene models, suggesting a reduction in protein length during recent human evolution. Linkage disequilibrium (LD) and Tajima's D analyses reveal that East Asians carry exceptionally large MUC5AC LD blocks with an excess of rare variation (p<0.05). To validate this result, we used Locityper for genotyping MUC5AC haplogroups in 2,600 unrelated samples from the 1000 Genomes Project. We observed signatures of positive selection in H1 and H2 among East Asians and a depletion of the likely ancestral haplogroup (H3). In Africans and Europeans, H3 alleles show an excess of common variation and deviate from Hardy-Weinberg equilibrium, consistent with heterozygote advantage and balancing selection. This study provides a generalizable strategy to characterize complex protein coding VNTRs for improved disease associations.
Collapse
Affiliation(s)
- Elizabeth G Plender
- Department of Genome Sciences, University of Washington School of Medicine, Seattle, WA 98195, USA
- Basic Sciences Division and Computational Biology Program, Fred Hutchinson Cancer Center, Seattle, WA 98109, USA
| | - Timofey Prodanov
- Institute for Medical Biometry and Bioinformatics, Medical Faculty, Heinrich Heine University, Moorenstr. 5, 40225 Düsseldorf, Germany
- Center for Digital Medicine, Heinrich Heine University, Moorenstr. 5, 40225 Düsseldorf, Germany
| | - PingHsun Hsieh
- Department of Genome Sciences, University of Washington School of Medicine, Seattle, WA 98195, USA
| | - Evangelos Nizamis
- Division of Medical Genetics, University of Washington School of Medicine, Seattle, WA 98195, USA
| | - William T Harvey
- Department of Genome Sciences, University of Washington School of Medicine, Seattle, WA 98195, USA
| | - Arvis Sulovari
- Department of Genome Sciences, University of Washington School of Medicine, Seattle, WA 98195, USA
| | - Katherine M Munson
- Department of Genome Sciences, University of Washington School of Medicine, Seattle, WA 98195, USA
| | - Eli J Kaufman
- Division of Medical Genetics, University of Washington School of Medicine, Seattle, WA 98195, USA
| | - Wanda K O'Neal
- Marsico Lung Institute/UNC CF Research Center, School of Medicine, University of North Carolina at Chapel Hill, Chapel Hill, 27599, North Carolina, USA
| | - Paul N Valdmanis
- Department of Genome Sciences, University of Washington School of Medicine, Seattle, WA 98195, USA
- Division of Medical Genetics, University of Washington School of Medicine, Seattle, WA 98195, USA
| | - Tobias Marschall
- Institute for Medical Biometry and Bioinformatics, Medical Faculty, Heinrich Heine University, Moorenstr. 5, 40225 Düsseldorf, Germany
- Center for Digital Medicine, Heinrich Heine University, Moorenstr. 5, 40225 Düsseldorf, Germany
| | - Jesse D Bloom
- Department of Genome Sciences, University of Washington School of Medicine, Seattle, WA 98195, USA
- Basic Sciences Division and Computational Biology Program, Fred Hutchinson Cancer Center, Seattle, WA 98109, USA
- Howard Hughes Medical Institute, Fred Hutchinson Cancer Center, Seattle, WA 98109, USA
| | - Evan E Eichler
- Department of Genome Sciences, University of Washington School of Medicine, Seattle, WA 98195, USA
- Howard Hughes Medical Institute, University of Washington, Seattle, WA 98195, USA
| |
Collapse
|
5
|
Guitart X, Porubsky D, Yoo D, Dougherty ML, Dishuck PC, Munson KM, Lewis AP, Hoekzema K, Knuth J, Chang S, Pastinen T, Eichler EE. Independent expansion, selection and hypervariability of the TBC1D3 gene family in humans. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.03.12.584650. [PMID: 38654825 PMCID: PMC11037872 DOI: 10.1101/2024.03.12.584650] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 04/26/2024]
Abstract
TBC1D3 is a primate-specific gene family that has expanded in the human lineage and has been implicated in neuronal progenitor proliferation and expansion of the frontal cortex. The gene family and its expression have been challenging to investigate because it is embedded in high-identity and highly variable segmental duplications. We sequenced and assembled the gene family using long-read sequencing data from 34 humans and 11 nonhuman primate species. Our analysis shows that this particular gene family has independently duplicated in at least five primate lineages, and the duplicated loci are enriched at sites of large-scale chromosomal rearrangements on chromosome 17. We find that most humans vary along two TBC1D3 clusters where human haplotypes are highly variable in copy number, differing by as many as 20 copies, and structure (structural heterozygosity 90%). We also show evidence of positive selection, as well as a significant change in the predicted human TBC1D3 protein sequence. Lastly, we find that, despite multiple duplications, human TBC1D3 expression is limited to a subset of copies and, most notably, from a single paralog group: TBC1D3-CDKL. These observations may help explain why a gene potentially important in cortical development can be so variable in the human population.
Collapse
Affiliation(s)
- Xavi Guitart
- Department of Genome Sciences, University of Washington School of Medicine, Seattle, WA, USA
| | - David Porubsky
- Department of Genome Sciences, University of Washington School of Medicine, Seattle, WA, USA
| | - DongAhn Yoo
- Department of Genome Sciences, University of Washington School of Medicine, Seattle, WA, USA
| | - Max L. Dougherty
- Tisch Cancer Institute, Division of Hematology and Medical Oncology, The Icahn School of Medicine at Mount Sinai, New York, NY, USA
| | - Philip C. Dishuck
- Department of Genome Sciences, University of Washington School of Medicine, Seattle, WA, USA
| | - Katherine M. Munson
- Department of Genome Sciences, University of Washington School of Medicine, Seattle, WA, USA
| | - Alexandra P. Lewis
- Department of Genome Sciences, University of Washington School of Medicine, Seattle, WA, USA
| | - Kendra Hoekzema
- Department of Genome Sciences, University of Washington School of Medicine, Seattle, WA, USA
| | - Jordan Knuth
- Department of Genome Sciences, University of Washington School of Medicine, Seattle, WA, USA
| | - Stephen Chang
- Department of Biochemistry, Stanford University School of Medicine, Stanford, CA, USA
- Department of Medicine, Division of Cardiovascular Medicine, Stanford University, Stanford, CA, USA
| | - Tomi Pastinen
- Department of Pediatrics, Genomic Medicine Center, Children’s Mercy Kansas City, Kansas City, MO, USA
- Department of Pediatrics, School of Medicine, University of Missouri Kansas City, Kansas City, MO, USA
| | - Evan E. Eichler
- Department of Genome Sciences, University of Washington School of Medicine, Seattle, WA, USA
- Howard Hughes Medical institute, University of Washington, Seattle, WA, USA
| |
Collapse
|