1
|
Dishuck PC, Munson KM, Lewis AP, Dougherty ML, Underwood JG, Harvey WT, Hsieh P, Pastinen T, Eichler EE. Structural variation, selection, and diversification of the NPIP gene family from the human pangenome. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2025:2025.02.04.636496. [PMID: 39975192 PMCID: PMC11838601 DOI: 10.1101/2025.02.04.636496] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 02/21/2025]
Abstract
The NPIP (nuclear pore interacting protein) gene family has expanded to high copy number in humans and African apes where it has been subject to an excess of amino acid replacement consistent with positive selection (1). Due to the limitations of short-read sequencing, NPIP human genetic diversity has been poorly understood. Using highly accurate assemblies generated from long-read sequencing as part of the human pangenome, we completely characterize 169 human haplotypes (4,665 NPIP paralogs and alleles). Of the 28 NPIP paralogs, just three ( NPIPB2 , B11 , and B14 ) are fixed at a single copy, and only a single locus, B2 , shows no structural variation. Four NPIP paralogs map to large segmental duplication blocks that mediate polymorphic inversions (355 kbp-1.6 Mbp) corresponding to microdeletions associated with developmental delay and autism. Haplotype-based tests of positive selection and selective sweeps identify two paralogs, B9 and B15 , within the top percentile for both tests. Using full-length cDNA data from 101 tissue/cell types, we construct paralog-specific gene models and show that 56% (31/55 most abundant isoforms) have not been previously described in RefSeq. We define six distinct translation start sites and other protein structural features that distinguish paralogs, including a variable number tandem repeat that encodes a beta helix of variable size that emerged ∼3.1 million years ago in human evolution. Among the 28 NPIP paralogs, we identify distinct tissue and developmental patterns of expression with only a few maintaining the ancestral testis-enriched expression. A subset of paralogs ( NPIPA1 , A5 , A6-9 , B3-5 , and B12/B13 ) show increased brain expression. Our results suggest ongoing positive selection in the human population and rapid diversification of NPIP gene models.
Collapse
|
2
|
Wang M, Zhang S, Li R, Zhao Q. Unraveling the specialized metabolic pathways in medicinal plant genomes: a review. FRONTIERS IN PLANT SCIENCE 2024; 15:1459533. [PMID: 39777086 PMCID: PMC11703845 DOI: 10.3389/fpls.2024.1459533] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 07/04/2024] [Accepted: 12/04/2024] [Indexed: 01/11/2025]
Abstract
Medicinal plants are important sources of bioactive specialized metabolites with significant therapeutic potential. Advances in multi-omics have accelerated the understanding of specialized metabolite biosynthesis and regulation. Genomics, transcriptomics, proteomics, and metabolomics have each contributed new insights into biosynthetic gene clusters (BGCs), metabolic pathways, and stress responses. However, single-omics approaches often fail to fully address these complex processes. Integrated multi-omics provides a holistic perspective on key regulatory networks. High-throughput sequencing and emerging technologies like single-cell and spatial omics have deepened our understanding of cell-specific and spatially resolved biosynthetic dynamics. Despite these advancements, challenges remain in managing large datasets, standardizing protocols, accounting for the dynamic nature of specialized metabolism, and effectively applying synthetic biology for sustainable specialized metabolite production. This review highlights recent progress in omics-based research on medicinal plants, discusses available bioinformatics tools, and explores future research trends aimed at leveraging integrated multi-omics to improve the medicinal quality and sustainable utilization of plant resources.
Collapse
Affiliation(s)
- Mingcheng Wang
- Institute for Advanced Study, Chengdu University, Chengdu, China
- Engineering Research Center of Sichuan-Tibet Traditional Medicinal Plant, Chengdu University, Chengdu, China
| | - Shuqiao Zhang
- School of Food and Biological Engineering, Chengdu University, Chengdu, China
| | - Rui Li
- Engineering Research Center of Sichuan-Tibet Traditional Medicinal Plant, Chengdu University, Chengdu, China
- School of Food and Biological Engineering, Chengdu University, Chengdu, China
| | - Qi Zhao
- Engineering Research Center of Sichuan-Tibet Traditional Medicinal Plant, Chengdu University, Chengdu, China
- School of Food and Biological Engineering, Chengdu University, Chengdu, China
| |
Collapse
|
3
|
Kulmanov M, Tawfiq R, Liu Y, Al Ali H, Abdelhakim M, Alarawi M, Aldakhil H, Alhattab D, Alsolme EA, Althagafi A, Angelov A, Bougouffa S, Driguez P, Park C, Putra A, Reyes-Ramos AM, Hauser CAE, Cheung MS, Abedalthagafi MS, Hoehndorf R. A reference quality, fully annotated diploid genome from a Saudi individual. Sci Data 2024; 11:1278. [PMID: 39580486 PMCID: PMC11585617 DOI: 10.1038/s41597-024-04121-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/28/2024] [Accepted: 11/11/2024] [Indexed: 11/25/2024] Open
Abstract
We have used multiple sequencing approaches to sequence the genome of a volunteer from Saudi Arabia. We use the resulting data to generate a de novo assembly of the genome, and use different computational approaches to refine the assembly. As a consequence, we provide a contiguous assembly of the complete genome of an individual from Saudi Arabia for all chromosomes except chromosome Y, and label this assembly KSA001. We transferred genome annotations from reference genomes to fully annotate KSA001, and we make all primary sequencing data, the assembly, and the genome annotations freely available in public databases using the FAIR data principles. KSA001 is the first telomere-to-telomere-assembled genome from a Saudi individual that is freely available for any purpose.
Collapse
Affiliation(s)
- Maxat Kulmanov
- Computational Bioscience Research Center (CBRC), King Abdullah University of Science and Technology, Thuwal, Saudi Arabia
- KAUST Center of Excellence for Smart Health (KCSH), King Abdullah University of Science and Technology, 4700 KAUST, 23955, Thuwal, Saudi Arabia
- KAUST Center of Excellence for Generative AI, King Abdullah University of Sciene and Technology, 4700 KAUST, 23955, Thuwal, Saudi Arabia
- Computer, Electrical and Mathematical Sciences & Engineering (CEMSE) Division, King Abdullah University of Science and Technology, Thuwal, Saudi Arabia
| | - Rund Tawfiq
- Computational Bioscience Research Center (CBRC), King Abdullah University of Science and Technology, Thuwal, Saudi Arabia
- KAUST Center of Excellence for Smart Health (KCSH), King Abdullah University of Science and Technology, 4700 KAUST, 23955, Thuwal, Saudi Arabia
- KAUST Center of Excellence for Generative AI, King Abdullah University of Sciene and Technology, 4700 KAUST, 23955, Thuwal, Saudi Arabia
- Biological and Environmental Sciences & Engineering (BESE) Division, King Abdullah University of Science and Technology, Thuwal, Saudi Arabia
| | - Yang Liu
- Computational Bioscience Research Center (CBRC), King Abdullah University of Science and Technology, Thuwal, Saudi Arabia
- KAUST Center of Excellence for Smart Health (KCSH), King Abdullah University of Science and Technology, 4700 KAUST, 23955, Thuwal, Saudi Arabia
- KAUST Center of Excellence for Generative AI, King Abdullah University of Sciene and Technology, 4700 KAUST, 23955, Thuwal, Saudi Arabia
- Biological and Environmental Sciences & Engineering (BESE) Division, King Abdullah University of Science and Technology, Thuwal, Saudi Arabia
| | - Hatoon Al Ali
- Computational Bioscience Research Center (CBRC), King Abdullah University of Science and Technology, Thuwal, Saudi Arabia
- Biological and Environmental Sciences & Engineering (BESE) Division, King Abdullah University of Science and Technology, Thuwal, Saudi Arabia
| | - Marwa Abdelhakim
- Computational Bioscience Research Center (CBRC), King Abdullah University of Science and Technology, Thuwal, Saudi Arabia
- KAUST Center of Excellence for Smart Health (KCSH), King Abdullah University of Science and Technology, 4700 KAUST, 23955, Thuwal, Saudi Arabia
- Computer, Electrical and Mathematical Sciences & Engineering (CEMSE) Division, King Abdullah University of Science and Technology, Thuwal, Saudi Arabia
| | - Mohammed Alarawi
- Computational Bioscience Research Center (CBRC), King Abdullah University of Science and Technology, Thuwal, Saudi Arabia
- Biological and Environmental Sciences & Engineering (BESE) Division, King Abdullah University of Science and Technology, Thuwal, Saudi Arabia
| | - Hind Aldakhil
- Computational Bioscience Research Center (CBRC), King Abdullah University of Science and Technology, Thuwal, Saudi Arabia
- Computer, Electrical and Mathematical Sciences & Engineering (CEMSE) Division, King Abdullah University of Science and Technology, Thuwal, Saudi Arabia
| | - Dana Alhattab
- Computational Bioscience Research Center (CBRC), King Abdullah University of Science and Technology, Thuwal, Saudi Arabia
- KAUST Center of Excellence for Smart Health (KCSH), King Abdullah University of Science and Technology, 4700 KAUST, 23955, Thuwal, Saudi Arabia
- Biological and Environmental Sciences & Engineering (BESE) Division, King Abdullah University of Science and Technology, Thuwal, Saudi Arabia
- Laboratory for Nanomedicine, Biological and Environmental Science & Engineering (BESE) Division, King Abdullah University of Science and Technology (KAUST), Thuwal, Saudi Arabia
| | - Ebtehal A Alsolme
- Genomic and Precision Medicine Department, King Fahad Medical City, Riyadh, Saudi Arabia
| | - Azza Althagafi
- Computational Bioscience Research Center (CBRC), King Abdullah University of Science and Technology, Thuwal, Saudi Arabia
- Computer, Electrical and Mathematical Sciences & Engineering (CEMSE) Division, King Abdullah University of Science and Technology, Thuwal, Saudi Arabia
- Computer Science Department, College of Computers and Information Technology, Taif University, Taif, Saudi Arabia
| | - Angel Angelov
- Core Labs, King Abdullah University of Science and Technology (KAUST), 4700 KAUST, 23955, Thuwal, Makkah, Saudi Arabia
| | - Salim Bougouffa
- Computational Bioscience Research Center (CBRC), King Abdullah University of Science and Technology, Thuwal, Saudi Arabia
| | - Patrick Driguez
- Core Labs, King Abdullah University of Science and Technology (KAUST), 4700 KAUST, 23955, Thuwal, Makkah, Saudi Arabia
| | - Changsook Park
- Core Labs, King Abdullah University of Science and Technology (KAUST), 4700 KAUST, 23955, Thuwal, Makkah, Saudi Arabia
| | - Alexander Putra
- Core Labs, King Abdullah University of Science and Technology (KAUST), 4700 KAUST, 23955, Thuwal, Makkah, Saudi Arabia
| | - Ana M Reyes-Ramos
- Core Labs, King Abdullah University of Science and Technology (KAUST), 4700 KAUST, 23955, Thuwal, Makkah, Saudi Arabia
| | - Charlotte A E Hauser
- Computational Bioscience Research Center (CBRC), King Abdullah University of Science and Technology, Thuwal, Saudi Arabia
- Biological and Environmental Sciences & Engineering (BESE) Division, King Abdullah University of Science and Technology, Thuwal, Saudi Arabia
- Laboratory for Nanomedicine, Biological and Environmental Science & Engineering (BESE) Division, King Abdullah University of Science and Technology (KAUST), Thuwal, Saudi Arabia
- Max Planck Institute for Biology of Ageing, Cologne, Germany
- Institute of Health Care Engineering with European Testing Center of Medical Devices, Graz University of Technology, Stremayrgasse 16/II, 8010, Graz, Austria
| | - Ming Sin Cheung
- Core Labs, King Abdullah University of Science and Technology (KAUST), 4700 KAUST, 23955, Thuwal, Makkah, Saudi Arabia
| | - Malak S Abedalthagafi
- Department of Pathology and Laboratory Medicine, Emory School of Medicine, Atlanta, GA, USA.
- King Salman Center for Disability Research, Riyadh, Saudi Arabia.
| | - Robert Hoehndorf
- Computational Bioscience Research Center (CBRC), King Abdullah University of Science and Technology, Thuwal, Saudi Arabia.
- KAUST Center of Excellence for Smart Health (KCSH), King Abdullah University of Science and Technology, 4700 KAUST, 23955, Thuwal, Saudi Arabia.
- KAUST Center of Excellence for Generative AI, King Abdullah University of Sciene and Technology, 4700 KAUST, 23955, Thuwal, Saudi Arabia.
- Computer, Electrical and Mathematical Sciences & Engineering (CEMSE) Division, King Abdullah University of Science and Technology, Thuwal, Saudi Arabia.
| |
Collapse
|
4
|
Koren S, Bao Z, Guarracino A, Ou S, Goodwin S, Jenike KM, Lucas J, McNulty B, Park J, Rautiainen M, Rhie A, Roelofs D, Schneiders H, Vrijenhoek I, Nijbroek K, Nordesjo O, Nurk S, Vella M, Lawrence KR, Ware D, Schatz MC, Garrison E, Huang S, McCombie WR, Miga KH, Wittenberg AHJ, Phillippy AM. Gapless assembly of complete human and plant chromosomes using only nanopore sequencing. Genome Res 2024; 34:1919-1930. [PMID: 39505490 PMCID: PMC11610574 DOI: 10.1101/gr.279334.124] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/15/2024] [Accepted: 10/08/2024] [Indexed: 11/08/2024]
Abstract
The combination of ultra-long (UL) Oxford Nanopore Technologies (ONT) sequencing reads with long, accurate Pacific Bioscience (PacBio) High Fidelity (HiFi) reads has enabled the completion of a human genome and spurred similar efforts to complete the genomes of many other species. However, this approach for complete, "telomere-to-telomere" genome assembly relies on multiple sequencing platforms, limiting its accessibility. ONT "Duplex" sequencing reads, where both strands of the DNA are read to improve quality, promise high per-base accuracy. To evaluate this new data type, we generated ONT Duplex data for three widely studied genomes: human HG002, Solanum lycopersicum Heinz 1706 (tomato), and Zea mays B73 (maize). For the diploid, heterozygous HG002 genome, we also used "Pore-C" chromatin contact mapping to completely phase the haplotypes. We found the accuracy of Duplex data to be similar to HiFi sequencing, but with read lengths tens of kilobases longer, and the Pore-C data to be compatible with existing diploid assembly algorithms. This combination of read length and accuracy enables the construction of a high-quality initial assembly, which can then be further resolved using the UL reads, and finally phased into chromosome-scale haplotypes with Pore-C. The resulting assemblies have a base accuracy exceeding 99.999% (Q50) and near-perfect continuity, with most chromosomes assembled as single contigs. We conclude that ONT sequencing is a viable alternative to HiFi sequencing for de novo genome assembly, and provides a multirun single-instrument solution for the reconstruction of complete genomes.
Collapse
Affiliation(s)
- Sergey Koren
- Genome Informatics Section, Center for Genomics and Data Science Research, National Human Genome Research Institute, National Institutes of Health, Bethesda, Maryland 20892, USA;
| | - Zhigui Bao
- Department of Molecular Biology, Max Planck Institute for Biology Tübingen, 72076 Tübingen, Baden-Württemberg, Germany
- Shenzhen Branch, Guangdong Laboratory of Lingnan Modern Agriculture, Genome Analysis Laboratory of the Ministry of Agriculture and Rural Affairs, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences, Shenzhen 518120, China
| | - Andrea Guarracino
- Department of Genetics, Genomics and Informatics, University of Tennessee Health Science Center, Memphis, Tennessee 38163, USA
| | - Shujun Ou
- Department of Molecular Genetics, Ohio State University, Columbus, Ohio 43210, USA
| | - Sara Goodwin
- Cold Spring Harbor Laboratory, Cold Spring Harbor, New York 11724, USA
| | - Katharine M Jenike
- Department of Computer Science, Johns Hopkins University, Baltimore, Maryland 21218, USA
| | - Julian Lucas
- Genomics Institute, University of California Santa Cruz, Santa Cruz, California 95060, USA
| | - Brandy McNulty
- Genomics Institute, University of California Santa Cruz, Santa Cruz, California 95060, USA
| | - Jimin Park
- Genomics Institute, University of California Santa Cruz, Santa Cruz, California 95060, USA
| | - Mikko Rautiainen
- Genome Informatics Section, Center for Genomics and Data Science Research, National Human Genome Research Institute, National Institutes of Health, Bethesda, Maryland 20892, USA
| | - Arang Rhie
- Genome Informatics Section, Center for Genomics and Data Science Research, National Human Genome Research Institute, National Institutes of Health, Bethesda, Maryland 20892, USA
| | | | | | | | | | - Olle Nordesjo
- Oxford Nanopore Technologies, Oxford OX4 4DQ, United Kingdom
| | - Sergey Nurk
- Oxford Nanopore Technologies, Oxford OX4 4DQ, United Kingdom
| | - Mike Vella
- Oxford Nanopore Technologies, Oxford OX4 4DQ, United Kingdom
| | | | - Doreen Ware
- Cold Spring Harbor Laboratory, Cold Spring Harbor, New York 11724, USA
- USDA ARS NEA Plant, Soil and Nutrition Laboratory Research Unit, Ithaca, New York 14853, USA
| | - Michael C Schatz
- Department of Computer Science, Johns Hopkins University, Baltimore, Maryland 21218, USA
| | - Erik Garrison
- Department of Genetics, Genomics and Informatics, University of Tennessee Health Science Center, Memphis, Tennessee 38163, USA
| | - Sanwen Huang
- Shenzhen Branch, Guangdong Laboratory of Lingnan Modern Agriculture, Genome Analysis Laboratory of the Ministry of Agriculture and Rural Affairs, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences, Shenzhen 518120, China
- State Key Laboratory of Tropical Crop Breeding, Chinese Academy of Tropical Agricultural Sciences, Haikou, Hainan 571101, China
| | | | - Karen H Miga
- Genomics Institute, University of California Santa Cruz, Santa Cruz, California 95060, USA
| | | | - Adam M Phillippy
- Genome Informatics Section, Center for Genomics and Data Science Research, National Human Genome Research Institute, National Institutes of Health, Bethesda, Maryland 20892, USA;
| |
Collapse
|
5
|
Wang Z, Wang G, Li H, Jiang H, Sun Y, Han G, Ma J, Liu Q, Zhang C, Zhang D, Zhang H, Li Y, Tang B, Wang W. Chromosome-level assembly for the complex genome of land hermit crab Coenobita brevimanus. Sci Data 2024; 11:1190. [PMID: 39488506 PMCID: PMC11531507 DOI: 10.1038/s41597-024-04031-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/17/2024] [Accepted: 10/23/2024] [Indexed: 11/04/2024] Open
Abstract
Land hermit crabs are a group of shell-carrying crabs that have evolved remarkable terrestrial adaptations in behavior, morphology, physiology, and biochemistry. However, the genetic mechanisms underlying these adaptations remain unclear. In addition, usually it is very difficult to get good genome assemblies for crustaceans. In this study, we managed to assemble the first chromosome-level genome for a land hermit crab (Coenobita brevimanus) with careful manual curation. The final assembly spans 4.74 Gb, with the contig N50 of 1.75 Mb and scaffold N50 of 42.95 Mb, encompassing 117 chromosomes that account for 96.54% of the genome. The evaluations including genome BUSCO (95.26%), Merqury qv (35.88) and the mapping ratio of pair-end short reads (99.48%) showed the high-continuity of C. brevimanus genome assembly, making it the genome with the highest quality in crustaceans with genome size bigger than 3 Gb. The availability of this chromosome-scale genome of crustaceans represents a valuable resource for the land hermit crab, which represents an independent water-to-land transition evolutionary event in the animal kingdom.
Collapse
Affiliation(s)
- Zhongkai Wang
- School of Ecology and Environment, Northwestern Polytechnical University, Xi'an, 710072, China
| | - Gang Wang
- Jiangsu Key Laboratory for Bioresources of Saline Soils, Jiangsu Provincial Key Laboratory of Coastal Wetland Bioresources and Environmental Protection, School of Wetlands, Yancheng Teachers University, Yancheng, 224002, China
| | - Haorong Li
- School of Ecology and Environment, Northwestern Polytechnical University, Xi'an, 710072, China
| | - Hui Jiang
- College of Life Science, Hainan Normal University, Haikou, 571158, China
| | - Yishan Sun
- School of Ecology and Environment, Northwestern Polytechnical University, Xi'an, 710072, China
| | - Ge Han
- School of Ecology and Environment, Northwestern Polytechnical University, Xi'an, 710072, China
| | - Jinrui Ma
- School of Ecology and Environment, Northwestern Polytechnical University, Xi'an, 710072, China
| | - Qiuning Liu
- Jiangsu Key Laboratory for Bioresources of Saline Soils, Jiangsu Provincial Key Laboratory of Coastal Wetland Bioresources and Environmental Protection, School of Wetlands, Yancheng Teachers University, Yancheng, 224002, China
| | - Chen Zhang
- School of Ecology and Environment, Northwestern Polytechnical University, Xi'an, 710072, China
| | - Daizhen Zhang
- Jiangsu Key Laboratory for Bioresources of Saline Soils, Jiangsu Provincial Key Laboratory of Coastal Wetland Bioresources and Environmental Protection, School of Wetlands, Yancheng Teachers University, Yancheng, 224002, China
| | - Huabin Zhang
- Jiangsu Key Laboratory for Bioresources of Saline Soils, Jiangsu Provincial Key Laboratory of Coastal Wetland Bioresources and Environmental Protection, School of Wetlands, Yancheng Teachers University, Yancheng, 224002, China
| | - Yongxin Li
- School of Ecology and Environment, Northwestern Polytechnical University, Xi'an, 710072, China.
| | - Boping Tang
- Jiangsu Key Laboratory for Bioresources of Saline Soils, Jiangsu Provincial Key Laboratory of Coastal Wetland Bioresources and Environmental Protection, School of Wetlands, Yancheng Teachers University, Yancheng, 224002, China.
| | - Wen Wang
- School of Ecology and Environment, Northwestern Polytechnical University, Xi'an, 710072, China.
| |
Collapse
|
6
|
Chen Q, Yang C, Zhang G, Wu D. GCI: a continuity inspector for complete genome assembly. Bioinformatics 2024; 40:btae633. [PMID: 39432569 PMCID: PMC11550331 DOI: 10.1093/bioinformatics/btae633] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/07/2024] [Revised: 10/08/2024] [Accepted: 10/18/2024] [Indexed: 10/23/2024] Open
Abstract
MOTIVATION Recent advances in long-read sequencing technologies have significantly facilitated the production of high-quality genome assembly. The telomere-to-telomere (T2T) gapless assembly has become the new golden standard of genome assembly efforts. Several recent efforts have claimed to produce T2T-level reference genomes. However, a universal standard is still missing to qualify a genome assembly to be at T2T standard. Traditional genome assembly assessment metrics (N50 and its derivatives) have no capacity in differentiating between nearly T2T assembly and the truly T2T assembly in continuity either globally or locally. Additionally, these metrics are independent of raw reads, making them inflated easily by artificial operations. Therefore, a gaplessness evaluation tool at single-nucleotide resolution to reflect true completeness is urgently needed in the era of complete genomes. RESULTS Here, we present a tool called Genome Continuity Inspector (GCI), designed to assess genome assembly continuity at single-base resolution, and evaluate how close an assembly is to the T2T level. GCI utilizes multiple aligners to map long reads from various sequencing platforms back to the assembly. By incorporating curated mapping coverage of high-confidence read alignments, GCI identifies potential assembly issues. Meanwhile, it provides GCI scores that quantify overall assembly continuity on the whole genome or chromosome scales. AVAILABILITY AND IMPLEMENTATION The open-source GCI code is freely available on Github (https://github.com/yeeus/GCI) under the MIT license.
Collapse
Affiliation(s)
- Quanyu Chen
- International Institutes of Medicine, The Fourth Affiliated Hospital, Zhejiang University School of Medicine, Yiwu 322000, China
- Center for Evolutionary & Organismal Biology, Liangzhu Laboratory, Zhejiang University Medical Center, Hangzhou 311121, China
- Chu Kochen Honors College, Zhejiang University, Hangzhou 310058, China
| | - Chentao Yang
- BGI Research, Shenzhen 518083, China
- BGI Research, Wuhan 430074, China
| | - Guojie Zhang
- International Institutes of Medicine, The Fourth Affiliated Hospital, Zhejiang University School of Medicine, Yiwu 322000, China
- Center for Evolutionary & Organismal Biology, Liangzhu Laboratory, Zhejiang University Medical Center, Hangzhou 311121, China
- Women’s Hospital, School of Medicine, Zhejiang University, Hangzhou 310006, China
| | - Dongya Wu
- Center for Evolutionary & Organismal Biology, Liangzhu Laboratory, Zhejiang University Medical Center, Hangzhou 311121, China
| |
Collapse
|
7
|
Coulter T, Hill C, McKnight AJ. Insights into the length and breadth of methodologies harnessed to study human telomeres. Biomark Res 2024; 12:127. [PMID: 39438947 PMCID: PMC11515763 DOI: 10.1186/s40364-024-00668-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/15/2024] [Accepted: 10/04/2024] [Indexed: 10/25/2024] Open
Abstract
Telomeres are protective structures at the end of eukaryotic chromosomes that are strongly implicated in ageing and ill health. They attrition upon every cellular reproductive cycle. Evidence suggests that short telomeres trigger DNA damage responses that lead to cellular senescence. Accurate methods for measuring telomeres are required to fully investigate the roles that shortening telomeres play in the biology of disease and human ageing. The last two decades have brought forth several techniques that are used for measuring telomeres. This editorial highlights strengths and limitations of traditional and emerging techniques, guiding researchers to choose the most appropriate methodology for their research needs. These methods include Quantitative Polymerase Chain Reaction (qPCR), Omega qPCR (Ω-qPCR), Terminal Restriction Fragment analysis (TRF), Single Telomere Absolute-length Rapid (STAR) assays, Single TElomere Length Analysis (STELA), TElomere Shortest Length Assays (TESLA), Telomere Combing Assays (TCA), and Long-Read Telomere Sequencing. Challenges include replicating telomere measurement within and across cohorts, measuring the length of telomeres on individual chromosomes, and standardised reporting for publications. Areas of current and future focus have been highlighted, with recent methodical advancements, such as long-read sequencing, providing significant scope to study telomeres at an individual chromosome level.
Collapse
Affiliation(s)
- Tiernan Coulter
- Centre for Public Health, Queen's University Belfast, Institute of Clinical Sciences - Block A, Royal Victoria Hospital, Grosvenor Road, Belfast, BT12 6BJ, UK
| | - Claire Hill
- Centre for Public Health, Queen's University Belfast, Institute of Clinical Sciences - Block A, Royal Victoria Hospital, Grosvenor Road, Belfast, BT12 6BJ, UK.
| | - Amy Jayne McKnight
- Centre for Public Health, Queen's University Belfast, Institute of Clinical Sciences - Block A, Royal Victoria Hospital, Grosvenor Road, Belfast, BT12 6BJ, UK.
| |
Collapse
|
8
|
Mastoras M, Asri M, Brambrink L, Hebbar P, Kolesnikov A, Cook DE, Nattestad M, Lucas J, Won TS, Chang PC, Carroll A, Paten B, Shafin K. Highly accurate assembly polishing with DeepPolisher. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.09.17.613505. [PMID: 39345401 PMCID: PMC11429912 DOI: 10.1101/2024.09.17.613505] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 10/01/2024]
Abstract
Accurate genome assemblies are essential for biological research, but even the highest quality assemblies retain errors caused by the technologies used to construct them. Base-level errors are typically fixed with an additional polishing step that uses reads aligned to the draft assembly to identify necessary edits. However, current methods struggle to find a balance between over-and under-polishing. Here, we present an encoder-only transformer model for assembly polishing called DeepPolisher, which predicts corrections to the underlying sequence using Pacbio HiFi read alignments to a diploid assembly. Our pipeline introduces a method, PHARAOH (Phasing Reads in Areas Of Homozygosity), which uses ultra-long ONT data to ensure alignments are accurately phased and to correctly introduce heterozygous edits in falsely homozygous regions. We demonstrate that the DeepPolisher pipeline can reduce assembly errors by half, with a greater than 70% reduction in indel errors. We have applied our DeepPolisher-based pipeline to 180 assemblies from the next Human Pangenome Reference Consortium (HPRC) data release, producing an average predicted Quality Value (QV) improvement of 3.4 (54% error reduction) for the majority of the genome.
Collapse
Affiliation(s)
- Mira Mastoras
- UC Santa Cruz Genomics Institute, University of California, Santa Cruz, CA, USA
| | - Mobin Asri
- UC Santa Cruz Genomics Institute, University of California, Santa Cruz, CA, USA
| | | | - Prajna Hebbar
- UC Santa Cruz Genomics Institute, University of California, Santa Cruz, CA, USA
| | | | | | | | - Julian Lucas
- UC Santa Cruz Genomics Institute, University of California, Santa Cruz, CA, USA
| | - Taylor S. Won
- UC Santa Cruz Genomics Institute, University of California, Santa Cruz, CA, USA
| | | | | | - Benedict Paten
- UC Santa Cruz Genomics Institute, University of California, Santa Cruz, CA, USA
| | | |
Collapse
|
9
|
Huang G, Bao Z, Feng L, Zhai J, Wendel JF, Cao X, Zhu Y. A telomere-to-telomere cotton genome assembly reveals centromere evolution and a Mutator transposon-linked module regulating embryo development. Nat Genet 2024; 56:1953-1963. [PMID: 39147922 DOI: 10.1038/s41588-024-01877-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/20/2023] [Accepted: 07/18/2024] [Indexed: 08/17/2024]
Abstract
Assembly of complete genomes can reveal functional genetic elements missing from draft sequences. Here we present the near-complete telomere-to-telomere and contiguous genome of the cotton species Gossypium raimondii. Our assembly identified gaps and misoriented or misassembled regions in previous assemblies and produced 13 centromeres, with 25 chromosomal ends having telomeres. In contrast to satellite-rich Arabidopsis and rice centromeres, cotton centromeres lack phased CENH3 nucleosome positioning patterns and probably evolved by invasion from long terminal repeat retrotransposons. In-depth expression profiling of transposable elements revealed a previously unannotated DNA transposon (MuTC01) that interacts with miR2947 to produce trans-acting small interfering RNAs (siRNAs), one of which targets the newly evolved LEC2 (LEC2b) to produce phased siRNAs. Systematic genome editing experiments revealed that this tripartite module, miR2947-MuTC01-LEC2b, controls the morphogenesis of complex folded embryos characteristic of Gossypium and its close relatives in the cotton tribe. Our study reveals a trans-acting siRNA-based tripartite regulatory pathway for embryo development in higher plants.
Collapse
Affiliation(s)
- Gai Huang
- State Key Laboratory of Protein and Plant Gene Research, School of Life Sciences, Peking University, Beijing, China.
- Institute for Advanced Studies, Wuhan University, Wuhan, China.
- Institute of Genetics and Developmental Biology, Chinese Academy of Sciences, Beijing, China.
| | - Zhigui Bao
- Max Planck Institute for Biology Tübingen, Tübingen, Germany
| | - Li Feng
- Department of Biology, School of Life Sciences, Southern University of Science and Technology, Shenzhen, China
| | - Jixian Zhai
- Department of Biology, School of Life Sciences, Southern University of Science and Technology, Shenzhen, China
| | - Jonathan F Wendel
- Department of Ecology, Evolution, and Organismal Biology, Iowa State University, Ames, IA, USA
| | - Xiaofeng Cao
- Institute of Genetics and Developmental Biology, Chinese Academy of Sciences, Beijing, China
| | - Yuxian Zhu
- State Key Laboratory of Protein and Plant Gene Research, School of Life Sciences, Peking University, Beijing, China.
- Institute for Advanced Studies, Wuhan University, Wuhan, China.
- Hubei Hongshan Laboratory, Wuhan, China.
- Taikang Center for Life and Medical Sciences, Wuhan University, Wuhan, China.
| |
Collapse
|
10
|
Li H, Durbin R. Genome assembly in the telomere-to-telomere era. Nat Rev Genet 2024; 25:658-670. [PMID: 38649458 DOI: 10.1038/s41576-024-00718-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 02/27/2024] [Indexed: 04/25/2024]
Abstract
Genome sequences largely determine the biology and encode the history of an organism, and de novo assembly - the process of reconstructing the genome sequence of an organism from sequencing reads - has been a central problem in bioinformatics for four decades. Until recently, genomes were typically assembled into fragments of a few megabases at best, but now technological advances in long-read sequencing enable the near-complete assembly of each chromosome - also known as telomere-to-telomere assembly - for many organisms. Here, we review recent progress on assembly algorithms and protocols, with a focus on how to derive near-telomere-to-telomere assemblies. We also discuss the additional developments that will be required to resolve remaining assembly gaps and to assemble non-diploid genomes.
Collapse
Affiliation(s)
- Heng Li
- Department of Data Science, Dana-Farber Cancer Institute, Boston, MA, USA.
- Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA.
| | - Richard Durbin
- Department of Genetics, Cambridge University, Cambridge, UK.
| |
Collapse
|
11
|
Li M, Chen C, Wang H, Qin H, Hou S, Yang X, Jian J, Gao P, Liu M, Mu Z. Telomere-to-telomere genome assembly of sorghum. Sci Data 2024; 11:835. [PMID: 39095379 PMCID: PMC11297213 DOI: 10.1038/s41597-024-03664-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/28/2023] [Accepted: 07/19/2024] [Indexed: 08/04/2024] Open
Abstract
"Cuohu Bazi" (CHBZ) is an ancient sorghum variety collected from the fields of China, known for its agronomic traits like dwarf stature, early maturation. In this study, we present the first telomere-to-telomere (T2T) and gap-free genome assembly of CHBZ using PacBio HiFi reads, Oxford Nanopore Technologies, and Hi-C data. The assembled genome comprises 724.85 Mb, effectively resolving all 3,913 gaps that were present in the previous sorghum BTx623 reference genome. Notably, the T2T assembly captures 10 centromeres and all 20 telomeres, providing strong support for their integrity. This assembly is of high quality in terms of contiguity (contig N50: 71.1 Mb), completeness (BUSCO score: 99.01%, k-mer completeness: 98.88%), and correctness (QV: 61.60). Repetitive sequences accounted for 70.41% of the genome and a total of 32,855 protein-coding genes have been annotated. Furthermore, 161 CHBZ-specific presence/absence variants genes have been identified when comparing to BTx623 genome. This study provides valuable insights for future research on sorghum genetics, genomics, and evolutionary history.
Collapse
Affiliation(s)
- Meng Li
- Center for Agricultural Genetic Resources Research, Shanxi Agricultural University, Key Laboratory of Crop Gene Resources and Germplasm Enhancement on Loess Plateau, Ministry of Agriculture and Rural Affairs, Taiyuan, 030031, China.
| | | | - Haigang Wang
- Center for Agricultural Genetic Resources Research, Shanxi Agricultural University, Key Laboratory of Crop Gene Resources and Germplasm Enhancement on Loess Plateau, Ministry of Agriculture and Rural Affairs, Taiyuan, 030031, China
| | - Huibin Qin
- Center for Agricultural Genetic Resources Research, Shanxi Agricultural University, Key Laboratory of Crop Gene Resources and Germplasm Enhancement on Loess Plateau, Ministry of Agriculture and Rural Affairs, Taiyuan, 030031, China
| | - Sen Hou
- Center for Agricultural Genetic Resources Research, Shanxi Agricultural University, Key Laboratory of Crop Gene Resources and Germplasm Enhancement on Loess Plateau, Ministry of Agriculture and Rural Affairs, Taiyuan, 030031, China
| | | | | | | | - Minxuan Liu
- Institute of Crop Sciences, Chinese Academy of Agricultural Sciences, Beijing, 100081, China.
| | - Zhixin Mu
- Center for Agricultural Genetic Resources Research, Shanxi Agricultural University, Key Laboratory of Crop Gene Resources and Germplasm Enhancement on Loess Plateau, Ministry of Agriculture and Rural Affairs, Taiyuan, 030031, China.
| |
Collapse
|
12
|
Taylor DJ, Eizenga JM, Li Q, Das A, Jenike KM, Kenny EE, Miga KH, Monlong J, McCoy RC, Paten B, Schatz MC. Beyond the Human Genome Project: The Age of Complete Human Genome Sequences and Pangenome References. Annu Rev Genomics Hum Genet 2024; 25:77-104. [PMID: 38663087 PMCID: PMC11451085 DOI: 10.1146/annurev-genom-021623-081639] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 08/29/2024]
Abstract
The Human Genome Project was an enormous accomplishment, providing a foundation for countless explorations into the genetics and genomics of the human species. Yet for many years, the human genome reference sequence remained incomplete and lacked representation of human genetic diversity. Recently, two major advances have emerged to address these shortcomings: complete gap-free human genome sequences, such as the one developed by the Telomere-to-Telomere Consortium, and high-quality pangenomes, such as the one developed by the Human Pangenome Reference Consortium. Facilitated by advances in long-read DNA sequencing and genome assembly algorithms, complete human genome sequences resolve regions that have been historically difficult to sequence, including centromeres, telomeres, and segmental duplications. In parallel, pangenomes capture the extensive genetic diversity across populations worldwide. Together, these advances usher in a new era of genomics research, enhancing the accuracy of genomic analysis, paving the path for precision medicine, and contributing to deeper insights into human biology.
Collapse
Affiliation(s)
- Dylan J Taylor
- Department of Biology, Johns Hopkins University, Baltimore, Maryland, USA; , ,
| | - Jordan M Eizenga
- Genomics Institute, University of California, Santa Cruz, California, USA; , ,
| | - Qiuhui Li
- Department of Computer Science, Johns Hopkins University, Baltimore, Maryland, USA; ,
| | - Arun Das
- Department of Computer Science, Johns Hopkins University, Baltimore, Maryland, USA; ,
| | - Katharine M Jenike
- Department of Genetic Medicine, Johns Hopkins University School of Medicine, Baltimore, Maryland, USA;
| | - Eimear E Kenny
- Institute for Genomic Health, Icahn School of Medicine at Mount Sinai, New York, NY, USA;
| | - Karen H Miga
- Department of Biomolecular Engineering, University of California, Santa Cruz, California, USA
- Genomics Institute, University of California, Santa Cruz, California, USA; , ,
| | - Jean Monlong
- Institut de Recherche en Santé Digestive, Université de Toulouse, INSERM, INRA, ENVT, UPS, Toulouse, France;
| | - Rajiv C McCoy
- Department of Biology, Johns Hopkins University, Baltimore, Maryland, USA; , ,
| | - Benedict Paten
- Department of Biomolecular Engineering, University of California, Santa Cruz, California, USA
- Genomics Institute, University of California, Santa Cruz, California, USA; , ,
| | - Michael C Schatz
- Department of Computer Science, Johns Hopkins University, Baltimore, Maryland, USA; ,
- Department of Biology, Johns Hopkins University, Baltimore, Maryland, USA; , ,
| |
Collapse
|
13
|
Zhang Y, Zhao M, Tan J, Huang M, Chu X, Li Y, Han X, Fang T, Tian Y, Jarret R, Lu D, Chen Y, Xue L, Li X, Qin G, Li B, Sun Y, Deng XW, Deng Y, Zhang X, He H. Telomere-to-telomere Citrullus super-pangenome provides direction for watermelon breeding. Nat Genet 2024; 56:1750-1761. [PMID: 38977857 PMCID: PMC11319210 DOI: 10.1038/s41588-024-01823-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/26/2023] [Accepted: 06/04/2024] [Indexed: 07/10/2024]
Abstract
To decipher the genetic diversity within the cucurbit genus Citrullus, we generated telomere-to-telomere (T2T) assemblies of 27 distinct genotypes, encompassing all seven Citrullus species. This T2T super-pangenome has expanded the previously published reference genome, T2T-G42, by adding 399.2 Mb and 11,225 genes. Comparative analysis has unveiled gene variants and structural variations (SVs), shedding light on watermelon evolution and domestication processes that enhanced attributes such as bitterness and sugar content while compromising disease resistance. Multidisease-resistant loci from Citrullus amarus and Citrullus mucosospermus were successfully introduced into cultivated Citrullus lanatus. The SVs identified in C. lanatus have not only been inherited from cordophanus but also from C. mucosospermus, suggesting additional ancestors beyond cordophanus in the lineage of cultivated watermelon. Our investigation substantially improves the comprehension of watermelon genome diversity, furnishing comprehensive reference genomes for all Citrullus species. This advancement aids in the exploration and genetic enhancement of watermelon using its wild relatives.
Collapse
Affiliation(s)
- Yilin Zhang
- National Key Laboratory of Wheat Improvement, Peking University Institute of Advanced Agricultural Sciences, Shandong Laboratory of Advanced Agricultural Sciences at Weifang, Weifang, China
- State Key Laboratory of Protein and Plant Gene Research, School of Advanced Agricultural Sciences and School of Life Sciences, Peking University, Beijing, China
| | - Mingxia Zhao
- National Key Laboratory of Wheat Improvement, Peking University Institute of Advanced Agricultural Sciences, Shandong Laboratory of Advanced Agricultural Sciences at Weifang, Weifang, China
| | - Jingsheng Tan
- National Key Laboratory of Wheat Improvement, Peking University Institute of Advanced Agricultural Sciences, Shandong Laboratory of Advanced Agricultural Sciences at Weifang, Weifang, China
| | - Minghan Huang
- State Key Laboratory of Protein and Plant Gene Research, School of Advanced Agricultural Sciences and School of Life Sciences, Peking University, Beijing, China
| | - Xiao Chu
- National Key Laboratory of Wheat Improvement, Peking University Institute of Advanced Agricultural Sciences, Shandong Laboratory of Advanced Agricultural Sciences at Weifang, Weifang, China
| | - Yan Li
- National Key Laboratory of Wheat Improvement, Peking University Institute of Advanced Agricultural Sciences, Shandong Laboratory of Advanced Agricultural Sciences at Weifang, Weifang, China
| | - Xue Han
- National Key Laboratory of Wheat Improvement, Peking University Institute of Advanced Agricultural Sciences, Shandong Laboratory of Advanced Agricultural Sciences at Weifang, Weifang, China
| | - Taohong Fang
- National Key Laboratory of Wheat Improvement, Peking University Institute of Advanced Agricultural Sciences, Shandong Laboratory of Advanced Agricultural Sciences at Weifang, Weifang, China
| | - Yao Tian
- National Key Laboratory of Wheat Improvement, Peking University Institute of Advanced Agricultural Sciences, Shandong Laboratory of Advanced Agricultural Sciences at Weifang, Weifang, China
| | | | - Dongdong Lu
- National Key Laboratory of Wheat Improvement, Peking University Institute of Advanced Agricultural Sciences, Shandong Laboratory of Advanced Agricultural Sciences at Weifang, Weifang, China
| | - Yijun Chen
- National Key Laboratory of Wheat Improvement, Peking University Institute of Advanced Agricultural Sciences, Shandong Laboratory of Advanced Agricultural Sciences at Weifang, Weifang, China
| | - Lifang Xue
- National Key Laboratory of Wheat Improvement, Peking University Institute of Advanced Agricultural Sciences, Shandong Laboratory of Advanced Agricultural Sciences at Weifang, Weifang, China
| | - Xiaoni Li
- National Key Laboratory of Wheat Improvement, Peking University Institute of Advanced Agricultural Sciences, Shandong Laboratory of Advanced Agricultural Sciences at Weifang, Weifang, China
| | - Guochen Qin
- National Key Laboratory of Wheat Improvement, Peking University Institute of Advanced Agricultural Sciences, Shandong Laboratory of Advanced Agricultural Sciences at Weifang, Weifang, China
| | - Bosheng Li
- National Key Laboratory of Wheat Improvement, Peking University Institute of Advanced Agricultural Sciences, Shandong Laboratory of Advanced Agricultural Sciences at Weifang, Weifang, China
| | - Yudong Sun
- Vegetable Research and Development Center, Huaiyin Institute of Agricultural Sciences of Xuhuai Region in Jiangsu, Huai'an, China
| | - Xing Wang Deng
- National Key Laboratory of Wheat Improvement, Peking University Institute of Advanced Agricultural Sciences, Shandong Laboratory of Advanced Agricultural Sciences at Weifang, Weifang, China
- State Key Laboratory of Protein and Plant Gene Research, School of Advanced Agricultural Sciences and School of Life Sciences, Peking University, Beijing, China
| | - Yun Deng
- National Key Laboratory of Wheat Improvement, Peking University Institute of Advanced Agricultural Sciences, Shandong Laboratory of Advanced Agricultural Sciences at Weifang, Weifang, China.
| | - Xingping Zhang
- National Key Laboratory of Wheat Improvement, Peking University Institute of Advanced Agricultural Sciences, Shandong Laboratory of Advanced Agricultural Sciences at Weifang, Weifang, China.
| | - Hang He
- National Key Laboratory of Wheat Improvement, Peking University Institute of Advanced Agricultural Sciences, Shandong Laboratory of Advanced Agricultural Sciences at Weifang, Weifang, China.
- State Key Laboratory of Protein and Plant Gene Research, School of Advanced Agricultural Sciences and School of Life Sciences, Peking University, Beijing, China.
| |
Collapse
|
14
|
Kolesnikov A, Cook D, Nattestad M, Brambrink L, McNulty B, Gorzynski J, Goenka S, Ashley EA, Jain M, Miga KH, Paten B, Chang PC, Carroll A, Shafin K. Local read haplotagging enables accurate long-read small variant calling. Nat Commun 2024; 15:5907. [PMID: 39003259 PMCID: PMC11246426 DOI: 10.1038/s41467-024-50079-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/20/2023] [Accepted: 06/28/2024] [Indexed: 07/15/2024] Open
Abstract
Long-read sequencing technology has enabled variant detection in difficult-to-map regions of the genome and enabled rapid genetic diagnosis in clinical settings. Rapidly evolving third-generation sequencing platforms like Pacific Biosciences (PacBio) and Oxford Nanopore Technologies (ONT) are introducing newer platforms and data types. It has been demonstrated that variant calling methods based on deep neural networks can use local haplotyping information with long-reads to improve the genotyping accuracy. However, using local haplotype information creates an overhead as variant calling needs to be performed multiple times which ultimately makes it difficult to extend to new data types and platforms as they get introduced. In this work, we have developed a local haplotype approximate method that enables state-of-the-art variant calling performance with multiple sequencing platforms including PacBio Revio system, ONT R10.4 simplex and duplex data. This addition of local haplotype approximation simplifies long-read variant calling with DeepVariant.
Collapse
Affiliation(s)
| | - Daniel Cook
- Google Inc, 1600 Amphitheatre Pkwy, Mountain View, CA, USA
| | | | | | - Brandy McNulty
- UC Santa Cruz Genomics Institute, University of California, Santa Cruz, CA, USA
| | | | | | | | - Miten Jain
- Northeastern university, Boston, MA, USA
| | - Karen H Miga
- UC Santa Cruz Genomics Institute, University of California, Santa Cruz, CA, USA
| | - Benedict Paten
- UC Santa Cruz Genomics Institute, University of California, Santa Cruz, CA, USA
| | - Pi-Chuan Chang
- Google Inc, 1600 Amphitheatre Pkwy, Mountain View, CA, USA
| | - Andrew Carroll
- Google Inc, 1600 Amphitheatre Pkwy, Mountain View, CA, USA.
| | - Kishwar Shafin
- Google Inc, 1600 Amphitheatre Pkwy, Mountain View, CA, USA.
| |
Collapse
|
15
|
Ungar RA, Goddard PC, Jensen TD, Degalez F, Smith KS, Jin CA, Bonner DE, Bernstein JA, Wheeler MT, Montgomery SB. Impact of genome build on RNA-seq interpretation and diagnostics. Am J Hum Genet 2024; 111:1282-1300. [PMID: 38834072 PMCID: PMC11267525 DOI: 10.1016/j.ajhg.2024.05.005] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/04/2024] [Revised: 05/04/2024] [Accepted: 05/06/2024] [Indexed: 06/06/2024] Open
Abstract
Transcriptomics is a powerful tool for unraveling the molecular effects of genetic variants and disease diagnosis. Prior studies have demonstrated that choice of genome build impacts variant interpretation and diagnostic yield for genomic analyses. To identify the extent genome build also impacts transcriptomics analyses, we studied the effect of the hg19, hg38, and CHM13 genome builds on expression quantification and outlier detection in 386 rare disease and familial control samples from both the Undiagnosed Diseases Network and Genomics Research to Elucidate the Genetics of Rare Disease Consortium. Across six routinely collected biospecimens, 61% of quantified genes were not influenced by genome build. However, we identified 1,492 genes with build-dependent quantification, 3,377 genes with build-exclusive expression, and 9,077 genes with annotation-specific expression across six routinely collected biospecimens, including 566 clinically relevant and 512 known OMIM genes. Further, we demonstrate that between builds for a given gene, a larger difference in quantification is well correlated with a larger change in expression outlier calling. Combined, we provide a database of genes impacted by build choice and recommend that transcriptomics-guided analyses and diagnoses are cross referenced with these data for robustness.
Collapse
Affiliation(s)
- Rachel A Ungar
- Department of Genetics, School of Medicine, Stanford University, Stanford, CA, USA; Department of Pathology, School of Medicine, Stanford University, Stanford, CA, USA
| | - Pagé C Goddard
- Department of Genetics, School of Medicine, Stanford University, Stanford, CA, USA; Department of Pathology, School of Medicine, Stanford University, Stanford, CA, USA
| | - Tanner D Jensen
- Department of Genetics, School of Medicine, Stanford University, Stanford, CA, USA; Department of Pathology, School of Medicine, Stanford University, Stanford, CA, USA
| | | | - Kevin S Smith
- Department of Pathology, School of Medicine, Stanford University, Stanford, CA, USA
| | - Christopher A Jin
- Department of Genetics, School of Medicine, Stanford University, Stanford, CA, USA
| | - Devon E Bonner
- Department of Pediatrics, School of Medicine, Stanford University, Stanford, CA, USA; Stanford Center for Undiagnosed Diseases, Stanford University, Stanford, CA, USA
| | - Jonathan A Bernstein
- Stanford Center for Undiagnosed Diseases, Stanford University, Stanford, CA, USA
| | - Matthew T Wheeler
- Department of Cardiovascular Medicine, School of Medicine, Stanford University, Stanford, CA, USA
| | - Stephen B Montgomery
- Department of Genetics, School of Medicine, Stanford University, Stanford, CA, USA; Department of Pathology, School of Medicine, Stanford University, Stanford, CA, USA; Department of Biomedical Data Science, Stanford University, Stanford, CA, USA.
| |
Collapse
|
16
|
Jia H, Tan S, Cai Y, Guo Y, Shen J, Zhang Y, Ma H, Zhang Q, Chen J, Qiao G, Ruan J, Zhang YE. Low-input PacBio sequencing generates high-quality individual fly genomes and characterizes mutational processes. Nat Commun 2024; 15:5644. [PMID: 38969648 PMCID: PMC11226609 DOI: 10.1038/s41467-024-49992-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/01/2023] [Accepted: 06/20/2024] [Indexed: 07/07/2024] Open
Abstract
Long-read sequencing, exemplified by PacBio, revolutionizes genomics, overcoming challenges like repetitive sequences. However, the high DNA requirement ( > 1 µg) is prohibitive for small organisms. We develop a low-input (100 ng), low-cost, and amplification-free library-generation method for PacBio sequencing (LILAP) using Tn5-based tagmentation and DNA circularization within one tube. We test LILAP with two Drosophila melanogaster individuals, and generate near-complete genomes, surpassing preexisting single-fly genomes. By analyzing variations in these two genomes, we characterize mutational processes: complex transpositions (transposon insertions together with extra duplications and/or deletions) prefer regions characterized by non-B DNA structures, and gene conversion of transposons occurs on both DNA and RNA levels. Concurrently, we generate two complete assemblies for the endosymbiotic bacterium Wolbachia in these flies and similarly detect transposon conversion. Thus, LILAP promises a broad PacBio sequencing adoption for not only mutational studies of flies and their symbionts but also explorations of other small organisms or precious samples.
Collapse
Affiliation(s)
- Hangxing Jia
- Key Laboratory of Zoological Systematics and Evolution, Institute of Zoology, Chinese Academy of Sciences, Beijing, China.
| | - Shengjun Tan
- Key Laboratory of Zoological Systematics and Evolution, Institute of Zoology, Chinese Academy of Sciences, Beijing, China.
| | - Yingao Cai
- Key Laboratory of Zoological Systematics and Evolution, Institute of Zoology, Chinese Academy of Sciences, Beijing, China
- University of Chinese Academy of Sciences, Beijing, China
| | - Yanyan Guo
- Key Laboratory of Zoological Systematics and Evolution, Institute of Zoology, Chinese Academy of Sciences, Beijing, China
- University of Chinese Academy of Sciences, Beijing, China
| | - Jieyu Shen
- Key Laboratory of Zoological Systematics and Evolution, Institute of Zoology, Chinese Academy of Sciences, Beijing, China
- University of Chinese Academy of Sciences, Beijing, China
| | - Yaqiong Zhang
- Key Laboratory of Zoological Systematics and Evolution, Institute of Zoology, Chinese Academy of Sciences, Beijing, China
| | - Huijing Ma
- Key Laboratory of Zoological Systematics and Evolution, Institute of Zoology, Chinese Academy of Sciences, Beijing, China
| | - Qingzhu Zhang
- Key Laboratory of Zoological Systematics and Evolution, Institute of Zoology, Chinese Academy of Sciences, Beijing, China
- University of Chinese Academy of Sciences, Beijing, China
| | - Jinfeng Chen
- University of Chinese Academy of Sciences, Beijing, China
- State Key Laboratory of Integrated Management of Pest Insects and Rodents, Institute of Zoology, Chinese Academy of Sciences, Beijing, China
| | - Gexia Qiao
- Key Laboratory of Zoological Systematics and Evolution, Institute of Zoology, Chinese Academy of Sciences, Beijing, China
- University of Chinese Academy of Sciences, Beijing, China
| | - Jue Ruan
- Shenzhen Branch, Guangdong Laboratory of Lingnan Modern Agriculture, Genome Analysis Laboratory of the Ministry of Agriculture and Rural Affairs, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences, Shenzhen, China.
| | - Yong E Zhang
- Key Laboratory of Zoological Systematics and Evolution, Institute of Zoology, Chinese Academy of Sciences, Beijing, China.
- University of Chinese Academy of Sciences, Beijing, China.
| |
Collapse
|
17
|
Wang H, Wang J, Chen C, Chen L, Li M, Qin H, Tian X, Hou S, Yang X, Jian J, Gao P, Wang L, Qiao Z, Mu Z. A complete reference genome of broomcorn millet. Sci Data 2024; 11:657. [PMID: 38906866 PMCID: PMC11192726 DOI: 10.1038/s41597-024-03489-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/30/2024] [Accepted: 06/06/2024] [Indexed: 06/23/2024] Open
Abstract
Broomcorn millet (Panicum miliaceum L.), known for its traits of drought resistance, adaptability to poor soil, short growth period, and high photosynthetic efficiency as a C4 plant, represents one of the earliest domesticated crops globally. This study reports the telomere-to-telomere (T2T) gap-free reference genome for broomcorn millet (AJ8) using PacBio high-fidelity (HiFi) long reads, Oxford Nanopore long-read technologies and high-throughput chromosome conformation capture (Hi-C) sequencing data. The size of AJ8 genome was approximately 834.7 Mb, anchored onto 18 pseudo-chromosomes. Notably, 18 centromeres and 36 telomeres were obtained. The assembled genome showed high quality in terms of completeness (BUSCO score: 99.6%, QV: 61.7, LAI value: 20.4). In addition, 63,678 protein-coding genes and 433.8 Mb (~52.0%) repetitive sequences were identified. The complete reference genome for broomcorn millet provides a valuable resource for genetic studies and breeding of this important cereal crop.
Collapse
Affiliation(s)
- Haigang Wang
- Center for Agricultural Genetic Resources Research, Shanxi Agricultural University, Key Laboratory of Crop Gene Resources and Germplasm Enhancement on Loess Plateau, Ministry of Agriculture and Rural Affairs, Taiyuan, 030031, China.
| | - Junjie Wang
- Center for Agricultural Genetic Resources Research, Shanxi Agricultural University, Key Laboratory of Crop Gene Resources and Germplasm Enhancement on Loess Plateau, Ministry of Agriculture and Rural Affairs, Taiyuan, 030031, China
| | | | - Ling Chen
- Center for Agricultural Genetic Resources Research, Shanxi Agricultural University, Key Laboratory of Crop Gene Resources and Germplasm Enhancement on Loess Plateau, Ministry of Agriculture and Rural Affairs, Taiyuan, 030031, China
| | - Meng Li
- Center for Agricultural Genetic Resources Research, Shanxi Agricultural University, Key Laboratory of Crop Gene Resources and Germplasm Enhancement on Loess Plateau, Ministry of Agriculture and Rural Affairs, Taiyuan, 030031, China
| | - Huibin Qin
- Center for Agricultural Genetic Resources Research, Shanxi Agricultural University, Key Laboratory of Crop Gene Resources and Germplasm Enhancement on Loess Plateau, Ministry of Agriculture and Rural Affairs, Taiyuan, 030031, China
| | - Xiang Tian
- Center for Agricultural Genetic Resources Research, Shanxi Agricultural University, Key Laboratory of Crop Gene Resources and Germplasm Enhancement on Loess Plateau, Ministry of Agriculture and Rural Affairs, Taiyuan, 030031, China
| | - Sen Hou
- Center for Agricultural Genetic Resources Research, Shanxi Agricultural University, Key Laboratory of Crop Gene Resources and Germplasm Enhancement on Loess Plateau, Ministry of Agriculture and Rural Affairs, Taiyuan, 030031, China
| | | | | | | | - Lun Wang
- Center for Agricultural Genetic Resources Research, Shanxi Agricultural University, Key Laboratory of Crop Gene Resources and Germplasm Enhancement on Loess Plateau, Ministry of Agriculture and Rural Affairs, Taiyuan, 030031, China.
| | - Zhijun Qiao
- Center for Agricultural Genetic Resources Research, Shanxi Agricultural University, Key Laboratory of Crop Gene Resources and Germplasm Enhancement on Loess Plateau, Ministry of Agriculture and Rural Affairs, Taiyuan, 030031, China.
| | - Zhixin Mu
- Center for Agricultural Genetic Resources Research, Shanxi Agricultural University, Key Laboratory of Crop Gene Resources and Germplasm Enhancement on Loess Plateau, Ministry of Agriculture and Rural Affairs, Taiyuan, 030031, China.
| |
Collapse
|
18
|
Makova KD, Pickett BD, Harris RS, Hartley GA, Cechova M, Pal K, Nurk S, Yoo D, Li Q, Hebbar P, McGrath BC, Antonacci F, Aubel M, Biddanda A, Borchers M, Bornberg-Bauer E, Bouffard GG, Brooks SY, Carbone L, Carrel L, Carroll A, Chang PC, Chin CS, Cook DE, Craig SJC, de Gennaro L, Diekhans M, Dutra A, Garcia GH, Grady PGS, Green RE, Haddad D, Hallast P, Harvey WT, Hickey G, Hillis DA, Hoyt SJ, Jeong H, Kamali K, Pond SLK, LaPolice TM, Lee C, Lewis AP, Loh YHE, Masterson P, McGarvey KM, McCoy RC, Medvedev P, Miga KH, Munson KM, Pak E, Paten B, Pinto BJ, Potapova T, Rhie A, Rocha JL, Ryabov F, Ryder OA, Sacco S, Shafin K, Shepelev VA, Slon V, Solar SJ, Storer JM, Sudmant PH, Sweetalana, Sweeten A, Tassia MG, Thibaud-Nissen F, Ventura M, Wilson MA, Young AC, Zeng H, Zhang X, Szpiech ZA, Huber CD, Gerton JL, Yi SV, Schatz MC, Alexandrov IA, Koren S, O'Neill RJ, Eichler EE, Phillippy AM. The complete sequence and comparative analysis of ape sex chromosomes. Nature 2024; 630:401-411. [PMID: 38811727 PMCID: PMC11168930 DOI: 10.1038/s41586-024-07473-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/17/2023] [Accepted: 04/26/2024] [Indexed: 05/31/2024]
Abstract
Apes possess two sex chromosomes-the male-specific Y chromosome and the X chromosome, which is present in both males and females. The Y chromosome is crucial for male reproduction, with deletions being linked to infertility1. The X chromosome is vital for reproduction and cognition2. Variation in mating patterns and brain function among apes suggests corresponding differences in their sex chromosomes. However, owing to their repetitive nature and incomplete reference assemblies, ape sex chromosomes have been challenging to study. Here, using the methodology developed for the telomere-to-telomere (T2T) human genome, we produced gapless assemblies of the X and Y chromosomes for five great apes (bonobo (Pan paniscus), chimpanzee (Pan troglodytes), western lowland gorilla (Gorilla gorilla gorilla), Bornean orangutan (Pongo pygmaeus) and Sumatran orangutan (Pongo abelii)) and a lesser ape (the siamang gibbon (Symphalangus syndactylus)), and untangled the intricacies of their evolution. Compared with the X chromosomes, the ape Y chromosomes vary greatly in size and have low alignability and high levels of structural rearrangements-owing to the accumulation of lineage-specific ampliconic regions, palindromes, transposable elements and satellites. Many Y chromosome genes expand in multi-copy families and some evolve under purifying selection. Thus, the Y chromosome exhibits dynamic evolution, whereas the X chromosome is more stable. Mapping short-read sequencing data to these assemblies revealed diversity and selection patterns on sex chromosomes of more than 100 individual great apes. These reference assemblies are expected to inform human evolution and conservation genetics of non-human apes, all of which are endangered species.
Collapse
Affiliation(s)
| | - Brandon D Pickett
- National Human Genome Research Institute, National Institutes of Health, Bethesda, MD, USA
| | | | | | - Monika Cechova
- University of California Santa Cruz, Santa Cruz, CA, USA
| | - Karol Pal
- Penn State University, University Park, PA, USA
| | - Sergey Nurk
- National Human Genome Research Institute, National Institutes of Health, Bethesda, MD, USA
| | - DongAhn Yoo
- University of Washington School of Medicine, Seattle, WA, USA
| | - Qiuhui Li
- Johns Hopkins University, Baltimore, MD, USA
| | - Prajna Hebbar
- University of California Santa Cruz, Santa Cruz, CA, USA
| | | | | | | | | | | | - Erich Bornberg-Bauer
- University of Münster, Münster, Germany
- MPI for Developmental Biology, Tübingen, Germany
| | - Gerard G Bouffard
- National Human Genome Research Institute, National Institutes of Health, Bethesda, MD, USA
| | - Shelise Y Brooks
- National Human Genome Research Institute, National Institutes of Health, Bethesda, MD, USA
| | - Lucia Carbone
- Oregon Health and Science University, Portland, OR, USA
- Oregon National Primate Research Center, Hillsboro, OR, USA
| | - Laura Carrel
- Penn State University School of Medicine, Hershey, PA, USA
| | | | | | - Chen-Shan Chin
- Foundation of Biological Data Sciences, Belmont, CA, USA
| | | | | | | | - Mark Diekhans
- University of California Santa Cruz, Santa Cruz, CA, USA
| | - Amalia Dutra
- National Human Genome Research Institute, National Institutes of Health, Bethesda, MD, USA
| | - Gage H Garcia
- University of Washington School of Medicine, Seattle, WA, USA
| | | | | | - Diana Haddad
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD, USA
| | - Pille Hallast
- The Jackson Laboratory for Genomic Medicine, Farmington, CT, USA
| | | | - Glenn Hickey
- University of California Santa Cruz, Santa Cruz, CA, USA
| | - David A Hillis
- University of California Santa Barbara, Santa Barbara, CA, USA
| | | | - Hyeonsoo Jeong
- University of Washington School of Medicine, Seattle, WA, USA
| | | | | | | | - Charles Lee
- The Jackson Laboratory for Genomic Medicine, Farmington, CT, USA
| | | | - Yong-Hwee E Loh
- University of California Santa Barbara, Santa Barbara, CA, USA
| | - Patrick Masterson
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD, USA
| | - Kelly M McGarvey
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD, USA
| | | | | | - Karen H Miga
- University of California Santa Cruz, Santa Cruz, CA, USA
| | | | - Evgenia Pak
- National Human Genome Research Institute, National Institutes of Health, Bethesda, MD, USA
| | - Benedict Paten
- University of California Santa Cruz, Santa Cruz, CA, USA
| | | | | | - Arang Rhie
- National Human Genome Research Institute, National Institutes of Health, Bethesda, MD, USA
| | - Joana L Rocha
- University of California Berkeley, Berkeley, CA, USA
| | - Fedor Ryabov
- Masters Program in National Research, University Higher School of Economics, Moscow, Russia
| | | | - Samuel Sacco
- University of California Santa Cruz, Santa Cruz, CA, USA
| | | | | | | | - Steven J Solar
- National Human Genome Research Institute, National Institutes of Health, Bethesda, MD, USA
| | | | | | - Sweetalana
- Penn State University, University Park, PA, USA
| | - Alex Sweeten
- National Human Genome Research Institute, National Institutes of Health, Bethesda, MD, USA
- Johns Hopkins University, Baltimore, MD, USA
| | | | - Françoise Thibaud-Nissen
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD, USA
| | - Mario Ventura
- Università degli Studi di Bari Aldo Moro, Bari, Italy
| | | | - Alice C Young
- National Human Genome Research Institute, National Institutes of Health, Bethesda, MD, USA
| | | | - Xinru Zhang
- Penn State University, University Park, PA, USA
| | | | | | | | - Soojin V Yi
- University of California Santa Barbara, Santa Barbara, CA, USA
| | | | | | - Sergey Koren
- National Human Genome Research Institute, National Institutes of Health, Bethesda, MD, USA
| | | | - Evan E Eichler
- University of Washington School of Medicine, Seattle, WA, USA.
- Howard Hughes Medical Institute, University of Washington, Seattle, WA, USA.
| | - Adam M Phillippy
- National Human Genome Research Institute, National Institutes of Health, Bethesda, MD, USA.
| |
Collapse
|
19
|
Huang J, Zhang Y, Li Y, Xing M, Lei C, Wang S, Nie Y, Wang Y, Zhao M, Han Z, Sun X, Zhou H, Wang Y, Zheng X, Xiao X, Fan W, Liu Z, Guo W, Zhang L, Cheng Y, Qian Q, He H, Yang Q, Qiao W. Haplotype-resolved gapless genome and chromosome segment substitution lines facilitate gene identification in wild rice. Nat Commun 2024; 15:4573. [PMID: 38811581 PMCID: PMC11137157 DOI: 10.1038/s41467-024-48845-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/14/2023] [Accepted: 05/15/2024] [Indexed: 05/31/2024] Open
Abstract
The abundant genetic variation harbored by wild rice (Oryza rufipogon) has provided a reservoir of useful genes for rice breeding. However, the genome of wild rice has not yet been comprehensively assessed. Here, we report the haplotype-resolved gapless genome assembly and annotation of wild rice Y476. In addition, we develop two sets of chromosome segment substitution lines (CSSLs) using Y476 as the donor parent and cultivated rice as the recurrent parents. By analyzing the gapless reference genome and CSSL population, we identify 254 QTLs associated with agronomic traits, biotic and abiotic stresses. We clone a receptor-like kinase gene associated with rice blast resistance and confirm its wild rice allele improves rice blast resistance. Collectively, our study provides a haplotype-resolved gapless reference genome and demonstrates a highly efficient platform for gene identification from wild rice.
Collapse
Affiliation(s)
- Jingfen Huang
- State Key Laboratory of Crop Gene Resources and Breeding, Institute of Crop Sciences, Chinese Academy of Agricultural Sciences, Beijing, China
| | - Yilin Zhang
- School of Advanced Agriculture Sciences and School of Life Sciences, State Key Laboratory of Protein and Plant Gene Research, Peking University, Beijing, China
- Peking University Institute of Advanced Agricultural Sciences, Shandong Laboratory of Advanced Agricultural Sciences at Weifang, Weifang, Shandong, China
| | - Yapeng Li
- National Nanfan Research Institute (Sanya), Chinese Academy of Agricultural Sciences, Sanya, Hainan, China
- Hainan Academy of Agricultural Sciences, Haikou, Hainan, China
| | - Meng Xing
- State Key Laboratory of Crop Gene Resources and Breeding, Institute of Crop Sciences, Chinese Academy of Agricultural Sciences, Beijing, China
- National Nanfan Research Institute (Sanya), Chinese Academy of Agricultural Sciences, Sanya, Hainan, China
| | - Cailin Lei
- State Key Laboratory of Crop Gene Resources and Breeding, Institute of Crop Sciences, Chinese Academy of Agricultural Sciences, Beijing, China
- National Nanfan Research Institute (Sanya), Chinese Academy of Agricultural Sciences, Sanya, Hainan, China
| | - Shizhuang Wang
- State Key Laboratory of Crop Gene Resources and Breeding, Institute of Crop Sciences, Chinese Academy of Agricultural Sciences, Beijing, China
- National Nanfan Research Institute (Sanya), Chinese Academy of Agricultural Sciences, Sanya, Hainan, China
| | - Yamin Nie
- State Key Laboratory of Crop Gene Resources and Breeding, Institute of Crop Sciences, Chinese Academy of Agricultural Sciences, Beijing, China
- National Nanfan Research Institute (Sanya), Chinese Academy of Agricultural Sciences, Sanya, Hainan, China
| | - Yanyan Wang
- State Key Laboratory of Crop Gene Resources and Breeding, Institute of Crop Sciences, Chinese Academy of Agricultural Sciences, Beijing, China
- National Nanfan Research Institute (Sanya), Chinese Academy of Agricultural Sciences, Sanya, Hainan, China
| | - Mingchao Zhao
- National Nanfan Research Institute (Sanya), Chinese Academy of Agricultural Sciences, Sanya, Hainan, China
- Hainan Academy of Agricultural Sciences, Haikou, Hainan, China
| | - Zhenyun Han
- State Key Laboratory of Crop Gene Resources and Breeding, Institute of Crop Sciences, Chinese Academy of Agricultural Sciences, Beijing, China
| | - Xianjun Sun
- State Key Laboratory of Crop Gene Resources and Breeding, Institute of Crop Sciences, Chinese Academy of Agricultural Sciences, Beijing, China
| | - Han Zhou
- School of Advanced Agriculture Sciences and School of Life Sciences, State Key Laboratory of Protein and Plant Gene Research, Peking University, Beijing, China
- Peking University Institute of Advanced Agricultural Sciences, Shandong Laboratory of Advanced Agricultural Sciences at Weifang, Weifang, Shandong, China
| | - Yan Wang
- Peking University Institute of Advanced Agricultural Sciences, Shandong Laboratory of Advanced Agricultural Sciences at Weifang, Weifang, Shandong, China
| | - Xiaoming Zheng
- State Key Laboratory of Crop Gene Resources and Breeding, Institute of Crop Sciences, Chinese Academy of Agricultural Sciences, Beijing, China
- National Nanfan Research Institute (Sanya), Chinese Academy of Agricultural Sciences, Sanya, Hainan, China
| | - Xiaorong Xiao
- National Nanfan Research Institute (Sanya), Chinese Academy of Agricultural Sciences, Sanya, Hainan, China
- Hainan Academy of Agricultural Sciences, Haikou, Hainan, China
| | - Weiya Fan
- State Key Laboratory of Crop Gene Resources and Breeding, Institute of Crop Sciences, Chinese Academy of Agricultural Sciences, Beijing, China
| | - Ziran Liu
- State Key Laboratory of Crop Gene Resources and Breeding, Institute of Crop Sciences, Chinese Academy of Agricultural Sciences, Beijing, China
| | - Wenlong Guo
- State Key Laboratory of Crop Gene Resources and Breeding, Institute of Crop Sciences, Chinese Academy of Agricultural Sciences, Beijing, China
| | - Lifang Zhang
- State Key Laboratory of Crop Gene Resources and Breeding, Institute of Crop Sciences, Chinese Academy of Agricultural Sciences, Beijing, China
| | - Yunlian Cheng
- State Key Laboratory of Crop Gene Resources and Breeding, Institute of Crop Sciences, Chinese Academy of Agricultural Sciences, Beijing, China
| | - Qian Qian
- State Key Laboratory of Crop Gene Resources and Breeding, Institute of Crop Sciences, Chinese Academy of Agricultural Sciences, Beijing, China
- National Nanfan Research Institute (Sanya), Chinese Academy of Agricultural Sciences, Sanya, Hainan, China
| | - Hang He
- School of Advanced Agriculture Sciences and School of Life Sciences, State Key Laboratory of Protein and Plant Gene Research, Peking University, Beijing, China.
- Peking University Institute of Advanced Agricultural Sciences, Shandong Laboratory of Advanced Agricultural Sciences at Weifang, Weifang, Shandong, China.
| | - Qingwen Yang
- State Key Laboratory of Crop Gene Resources and Breeding, Institute of Crop Sciences, Chinese Academy of Agricultural Sciences, Beijing, China.
- National Nanfan Research Institute (Sanya), Chinese Academy of Agricultural Sciences, Sanya, Hainan, China.
| | - Weihua Qiao
- State Key Laboratory of Crop Gene Resources and Breeding, Institute of Crop Sciences, Chinese Academy of Agricultural Sciences, Beijing, China.
- National Nanfan Research Institute (Sanya), Chinese Academy of Agricultural Sciences, Sanya, Hainan, China.
| |
Collapse
|
20
|
Hu J, Wang Z, Liang F, Liu SL, Ye K, Wang DP. NextPolish2: A Repeat-aware Polishing Tool for Genomes Assembled Using HiFi Long Reads. GENOMICS, PROTEOMICS & BIOINFORMATICS 2024; 22:qzad009. [PMID: 38862426 DOI: 10.1093/gpbjnl/qzad009] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 05/10/2023] [Revised: 10/14/2023] [Accepted: 10/31/2023] [Indexed: 06/13/2024]
Abstract
The high-fidelity (HiFi) long-read sequencing technology developed by PacBio has greatly improved the base-level accuracy of genome assemblies. However, these assemblies still contain base-level errors, particularly within the error-prone regions of HiFi long reads. Existing genome polishing tools usually introduce overcorrections and haplotype switch errors when correcting errors in genomes assembled from HiFi long reads. Here, we describe an upgraded genome polishing tool - NextPolish2, which can fix base errors remaining in those "highly accurate" genomes assembled from HiFi long reads without introducing excessive overcorrections and haplotype switch errors. We believe that NextPolish2 has a great significance to further improve the accuracy of telomere-to-telomere (T2T) genomes. NextPolish2 is freely available at https://github.com/Nextomics/NextPolish2.
Collapse
Affiliation(s)
- Jiang Hu
- School of Automation Science and Engineering, Faculty of Electronic and Information Engineering, Xi'an Jiaotong University, Xi'an 710049, China
- GrandOmics Biosciences, Beijing 102206, China
| | - Zhuo Wang
- GrandOmics Biosciences, Beijing 102206, China
| | - Fan Liang
- GrandOmics Biosciences, Beijing 102206, China
| | - Shan-Lin Liu
- Department of Entomology, College of Plant Protection, China Agricultural University, Beijing 100193, China
| | - Kai Ye
- School of Automation Science and Engineering, Faculty of Electronic and Information Engineering, Xi'an Jiaotong University, Xi'an 710049, China
| | | |
Collapse
|
21
|
Zhang S, Xu N, Fu L, Yang X, Li Y, Yang Z, Feng Y, Ma K, Jiang X, Han J, Hu R, Zhang L, de Gennaro L, Ryabov F, Meng D, He Y, Wu D, Yang C, Paparella A, Mao Y, Bian X, Lu Y, Antonacci F, Ventura M, Shepelev VA, Miga KH, Alexandrov IA, Logsdon GA, Phillippy AM, Su B, Zhang G, Eichler EE, Lu Q, Shi Y, Sun Q, Mao Y. Comparative genomics of macaques and integrated insights into genetic variation and population history. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.04.07.588379. [PMID: 38645259 PMCID: PMC11030432 DOI: 10.1101/2024.04.07.588379] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 04/23/2024]
Abstract
The crab-eating macaques ( Macaca fascicularis ) and rhesus macaques ( M. mulatta ) are widely studied nonhuman primates in biomedical and evolutionary research. Despite their significance, the current understanding of the complex genomic structure in macaques and the differences between species requires substantial improvement. Here, we present a complete genome assembly of a crab-eating macaque and 20 haplotype-resolved macaque assemblies to investigate the complex regions and major genomic differences between species. Segmental duplication in macaques is ∼42% lower, while centromeres are ∼3.7 times longer than those in humans. The characterization of ∼2 Mbp fixed genetic variants and ∼240 Mbp complex loci highlights potential associations with metabolic differences between the two macaque species (e.g., CYP2C76 and EHBP1L1 ). Additionally, hundreds of alternative splicing differences show post-transcriptional regulation divergence between these two species (e.g., PNPO ). We also characterize 91 large-scale genomic differences between macaques and humans at a single-base-pair resolution and highlight their impact on gene regulation in primate evolution (e.g., FOLH1 and PIEZO2 ). Finally, population genetics recapitulates macaque speciation and selective sweeps, highlighting potential genetic basis of reproduction and tail phenotype differences (e.g., STAB1 , SEMA3F , and HOXD13 ). In summary, the integrated analysis of genetic variation and population genetics in macaques greatly enhances our comprehension of lineage-specific phenotypes, adaptation, and primate evolution, thereby improving their biomedical applications in human diseases.
Collapse
|
22
|
Koren S, Bao Z, Guarracino A, Ou S, Goodwin S, Jenike KM, Lucas J, McNulty B, Park J, Rautiainen M, Rhie A, Roelofs D, Schneiders H, Vrijenhoek I, Nijbroek K, Ware D, Schatz MC, Garrison E, Huang S, McCombie WR, Miga KH, Wittenberg AHJ, Phillippy AM. Gapless assembly of complete human and plant chromosomes using only nanopore sequencing. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.03.15.585294. [PMID: 38529488 PMCID: PMC10962732 DOI: 10.1101/2024.03.15.585294] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 03/27/2024]
Abstract
The combination of ultra-long Oxford Nanopore (ONT) sequencing reads with long, accurate PacBio HiFi reads has enabled the completion of a human genome and spurred similar efforts to complete the genomes of many other species. However, this approach for complete, "telomere-to-telomere" genome assembly relies on multiple sequencing platforms, limiting its accessibility. ONT "Duplex" sequencing reads, where both strands of the DNA are read to improve quality, promise high per-base accuracy. To evaluate this new data type, we generated ONT Duplex data for three widely-studied genomes: human HG002, Solanum lycopersicum Heinz 1706 (tomato), and Zea mays B73 (maize). For the diploid, heterozygous HG002 genome, we also used "Pore-C" chromatin contact mapping to completely phase the haplotypes. We found the accuracy of Duplex data to be similar to HiFi sequencing, but with read lengths tens of kilobases longer, and the Pore-C data to be compatible with existing diploid assembly algorithms. This combination of read length and accuracy enables the construction of a high-quality initial assembly, which can then be further resolved using the ultra-long reads, and finally phased into chromosome-scale haplotypes with Pore-C. The resulting assemblies have a base accuracy exceeding 99.999% (Q50) and near-perfect continuity, with most chromosomes assembled as single contigs. We conclude that ONT sequencing is a viable alternative to HiFi sequencing for de novo genome assembly, and has the potential to provide a single-instrument solution for the reconstruction of complete genomes.
Collapse
Affiliation(s)
- Sergey Koren
- Genome Informatics Section, Center for Genomics and Data Science Research, National Human Genome Research Institute, National Institutes of Health, Bethesda, MD, USA
| | - Zhigui Bao
- Department of Molecular Biology, Max Planck Institute for Biology Tübingen, Tübingen, BadenWürttemberg, Germany
- Shenzhen Branch, Guangdong Laboratory of Lingnan Modern Agriculture, Genome Analysis Laboratory of the Ministry of Agriculture and Rural Affairs, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences, Shenzhen, China
| | - Andrea Guarracino
- Department of Genetics, Genomics and Informatics, University of Tennessee Health Science Center, Memphis, Tennessee, USA
- Human Technopole, Milan, Italy
| | - Shujun Ou
- Ohio State University, Columbus, OH, USA
| | - Sara Goodwin
- Cold Spring Harbor Laboratory, Cold Spring Harbor, NY, USA
| | - Katharine M Jenike
- Department of Computer Science, Johns Hopkins University, Baltimore, MD, USA
| | - Julian Lucas
- University of California Santa Cruz, Santa Cruz, CA, USA
| | - Brandy McNulty
- University of California Santa Cruz, Santa Cruz, CA, USA
| | - Jimin Park
- University of California Santa Cruz, Santa Cruz, CA, USA
| | - Mikko Rautiainen
- Genome Informatics Section, Center for Genomics and Data Science Research, National Human Genome Research Institute, National Institutes of Health, Bethesda, MD, USA
| | - Arang Rhie
- Genome Informatics Section, Center for Genomics and Data Science Research, National Human Genome Research Institute, National Institutes of Health, Bethesda, MD, USA
| | - Dick Roelofs
- KeyGene, Agro Business Park 90, 6708 PW Wageningen, Netherlands
| | | | - Ilse Vrijenhoek
- KeyGene, Agro Business Park 90, 6708 PW Wageningen, Netherlands
| | - Koen Nijbroek
- KeyGene, Agro Business Park 90, 6708 PW Wageningen, Netherlands
| | - Doreen Ware
- Cold Spring Harbor Laboratory, Cold Spring Harbor, NY, USA
| | - Michael C Schatz
- Department of Computer Science, Johns Hopkins University, Baltimore, MD, USA
| | - Erik Garrison
- Department of Genetics, Genomics and Informatics, University of Tennessee Health Science Center, Memphis, Tennessee, USA
| | - Sanwen Huang
- Shenzhen Branch, Guangdong Laboratory of Lingnan Modern Agriculture, Genome Analysis Laboratory of the Ministry of Agriculture and Rural Affairs, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences, Shenzhen, China
- State Key Laboratory of Tropical Crop Breeding, Chinese Academy of Tropical Agricultural Sciences, Haikou, Hainan, China
| | | | - Karen H Miga
- University of California Santa Cruz, Santa Cruz, CA, USA
| | | | - Adam M Phillippy
- Genome Informatics Section, Center for Genomics and Data Science Research, National Human Genome Research Institute, National Institutes of Health, Bethesda, MD, USA
| |
Collapse
|
23
|
Wang H, Chang TS, Dombroski BA, Cheng PL, Patil V, Valiente-Banuet L, Farrell K, Mclean C, Molina-Porcel L, Rajput A, De Deyn PP, Bastard NL, Gearing M, Kaat LD, Swieten JCV, Dopper E, Ghetti BF, Newell KL, Troakes C, de Yébenes JG, Rábano-Gutierrez A, Meller T, Oertel WH, Respondek G, Stamelou M, Arzberger T, Roeber S, Müller U, Hopfner F, Pastor P, Brice A, Durr A, Ber IL, Beach TG, Serrano GE, Hazrati LN, Litvan I, Rademakers R, Ross OA, Galasko D, Boxer AL, Miller BL, Seeley WW, Deerlin VMV, Lee EB, White CL, Morris H, de Silva R, Crary JF, Goate AM, Friedman JS, Leung YY, Coppola G, Naj AC, Wang LS, Dickson DW, Höglinger GU, Schellenberg GD, Geschwind DH, Lee WP. Whole-Genome Sequencing Analysis Reveals New Susceptibility Loci and Structural Variants Associated with Progressive Supranuclear Palsy. MEDRXIV : THE PREPRINT SERVER FOR HEALTH SCIENCES 2024:2023.12.28.23300612. [PMID: 38234807 PMCID: PMC10793533 DOI: 10.1101/2023.12.28.23300612] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/19/2024]
Abstract
Background Progressive supranuclear palsy (PSP) is a rare neurodegenerative disease characterized by the accumulation of aggregated tau proteins in astrocytes, neurons, and oligodendrocytes. Previous genome-wide association studies for PSP were based on genotype array, therefore, were inadequate for the analysis of rare variants as well as larger mutations, such as small insertions/deletions (indels) and structural variants (SVs). Method In this study, we performed whole genome sequencing (WGS) and conducted association analysis for single nucleotide variants (SNVs), indels, and SVs, in a cohort of 1,718 cases and 2,944 controls of European ancestry. Of the 1,718 PSP individuals, 1,441 were autopsy-confirmed and 277 were clinically diagnosed. Results Our analysis of common SNVs and indels confirmed known genetic loci at MAPT, MOBP, STX6, SLCO1A2, DUSP10, and SP1, and further uncovered novel signals in APOE, FCHO1/MAP1S, KIF13A, TRIM24, TNXB, and ELOVL1. Notably, in contrast to Alzheimer's disease (AD), we observed the APOE ε2 allele to be the risk allele in PSP. Analysis of rare SNVs and indels identified significant association in ZNF592 and further gene network analysis identified a module of neuronal genes dysregulated in PSP. Moreover, seven common SVs associated with PSP were observed in the H1/H2 haplotype region (17q21.31) and other loci, including IGH, PCMT1, CYP2A13, and SMCP. In the H1/H2 haplotype region, there is a burden of rare deletions and duplications (P = 6.73×10-3) in PSP. Conclusions Through WGS, we significantly enhanced our understanding of the genetic basis of PSP, providing new targets for exploring disease mechanisms and therapeutic interventions.
Collapse
Affiliation(s)
- Hui Wang
- Department of Pathology and Laboratory Medicine, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, USA
- Penn Neurodegeneration Genomics Center, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, USA
| | - Timothy S Chang
- Movement Disorders Programs, Department of Neurology, David Geffen School of Medicine, University of California, Los Angeles, Los Angeles, CA, USA
| | - Beth A Dombroski
- Department of Pathology and Laboratory Medicine, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, USA
- Penn Neurodegeneration Genomics Center, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, USA
| | - Po-Liang Cheng
- Department of Pathology and Laboratory Medicine, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, USA
- Penn Neurodegeneration Genomics Center, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, USA
| | - Vishakha Patil
- Movement Disorders Programs, Department of Neurology, David Geffen School of Medicine, University of California, Los Angeles, Los Angeles, CA, USA
| | - Leopoldo Valiente-Banuet
- Movement Disorders Programs, Department of Neurology, David Geffen School of Medicine, University of California, Los Angeles, Los Angeles, CA, USA
| | - Kurt Farrell
- Department of Pathology, Department of Artificial Intelligence & Human Health, Nash Family, Department of Neuroscience, Ronald M. Loeb Center for Alzheimer's Disease, Friedman Brain, Institute, Neuropathology Brain Bank & Research CoRE, Icahn School of Medicine at Mount Sinai, New York, NY, USA
| | - Catriona Mclean
- Victorian Brain Bank, The Florey Institute of Neuroscience and Mental Health, Parkville, Victoria, Australia
| | - Laura Molina-Porcel
- Alzheimer's disease and other cognitive disorders unit. Neurology Service, Hospital Clínic, Fundació Recerca Clínic Barcelona (FRCB). Institut d'Investigacions Biomediques August Pi I Sunyer (IDIBAPS), University of Barcelona, Barcelona, Spain
- Neurological Tissue Bank of the Biobanc-Hospital Clínic-IDIBAPS, Barcelona, Spain
| | - Alex Rajput
- Movement Disorders Program, Division of Neurology, University of Saskatchewan, Saskatoon, Saskatchewan, Canada
| | - Peter Paul De Deyn
- Laboratory of Neurochemistry and Behavior, Experimental Neurobiology Unit, University of Antwerp, Wilrijk (Antwerp), Belgium
- Department of Neurology, University Medical Center Groningen, NL-9713 AV Groningen, Netherlands
| | | | - Marla Gearing
- Department of Pathology and Laboratory Medicine and Department of Neurology, Emory University School of Medicine, Atlanta, GA, USA
| | | | | | - Elise Dopper
- Netherlands Brain Bank and Erasmus University, Netherlands
| | - Bernardino F Ghetti
- Department of Pathology and Laboratory Medicine, Indiana University School of Medicine, Indianapolis, IN, USA
| | - Kathy L Newell
- Department of Pathology and Laboratory Medicine, Indiana University School of Medicine, Indianapolis, IN, USA
| | - Claire Troakes
- London Neurodegenerative Diseases Brain Bank, King's College London, London, UK
| | | | - Alberto Rábano-Gutierrez
- Fundación CIEN (Centro de Investigación de Enfermedades Neurológicas) - Centro Alzheimer Fundación Reina Sofía, Madrid, Spain
| | - Tina Meller
- Department of Neurology, Philipps-Universität, Marburg, Germany
| | | | - Gesine Respondek
- German Center for Neurodegenerative Diseases (DZNE), Munich, Germany
| | - Maria Stamelou
- Parkinson's disease and Movement Disorders Department, HYGEIA Hospital, Athens, Greece
- European University of Cyprus, Nicosia, Cyprus
| | - Thomas Arzberger
- Department of Psychiatry and Psychotherapy, University Hospital Munich, Ludwig-Maximilians-University Munich, Germany
- Center for Neuropathology and Prion Research, Ludwig-Maximilians-University Munich, Germany
| | | | | | - Franziska Hopfner
- Department of Biostatistics, Epidemiology, and Informatics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, USA
| | - Pau Pastor
- Unit of Neurodegenerative diseases, Department of Neurology, University Hospital Germans Trias i Pujol, Badalona, Barcelona, Spain
- Neurosciences, The Germans Trias i Pujol Research Institute (IGTP) Badalona, Badalona, Spain
| | - Alexis Brice
- Sorbonne Université, Paris Brain Institute - Institut du Cerveau - ICM, Inserm U1127, CNRS UMR 7225, APHP - Hôpital Pitié-Salpêtrière, Paris, France
| | - Alexandra Durr
- Sorbonne Université, Paris Brain Institute - Institut du Cerveau - ICM, Inserm U1127, CNRS UMR 7225, APHP - Hôpital Pitié-Salpêtrière, Paris, France
| | - Isabelle Le Ber
- Sorbonne Université, Paris Brain Institute - Institut du Cerveau - ICM, Inserm U1127, CNRS UMR 7225, APHP - Hôpital Pitié-Salpêtrière, Paris, France
| | | | | | | | - Irene Litvan
- Department of Neuroscience, University of California, San Diego, CA, USA
| | - Rosa Rademakers
- VIB Center for Molecular Neurology, University of Antwerp, Belgium
- Department of Neuroscience, Mayo Clinic Jacksonville, FL, USA
| | - Owen A Ross
- Department of Neuroscience, Mayo Clinic Jacksonville, FL, USA
| | - Douglas Galasko
- Department of Neuroscience, University of California, San Diego, CA, USA
| | - Adam L Boxer
- Memory and Aging Center, University of California, San Francisco, CA, USA
| | - Bruce L Miller
- Memory and Aging Center, University of California, San Francisco, CA, USA
| | - Willian W Seeley
- Memory and Aging Center, University of California, San Francisco, CA, USA
| | - Vivanna M Van Deerlin
- Department of Pathology and Laboratory Medicine, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, USA
| | - Edward B Lee
- Department of Pathology and Laboratory Medicine, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, USA
- Penn Center for Neurodegenerative Disease Research, University of Pennsylvania School of Medicine, Philadelphia, PA, USA
| | - Charles L White
- University of Texas Southwestern Medical Center, Dallas, TX, USA
| | - Huw Morris
- Departmento of Clinical and Movement Neuroscience, University College of London, London, UK
| | - Rohan de Silva
- Reta Lila Weston Institute, UCL Queen Square Institute of Neurology, London, UK
| | - John F Crary
- Department of Pathology, Department of Artificial Intelligence & Human Health, Nash Family, Department of Neuroscience, Ronald M. Loeb Center for Alzheimer's Disease, Friedman Brain, Institute, Neuropathology Brain Bank & Research CoRE, Icahn School of Medicine at Mount Sinai, New York, NY, USA
| | - Alison M Goate
- Department of Genetics and Genomic Sciences, New York, NY, USA; Icahn School of Medicine at Mount Sinai, New York, NY, USA
| | - Jeffrey S Friedman
- Friedman Bioventure, Inc., Del Mar, CA, USA; Department of Genetics and Genomic Sciences, New York, NY, USA
| | - Yuk Yee Leung
- Department of Pathology and Laboratory Medicine, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, USA
- Penn Neurodegeneration Genomics Center, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, USA
| | - Giovanni Coppola
- Movement Disorders Programs, Department of Neurology, David Geffen School of Medicine, University of California, Los Angeles, Los Angeles, CA, USA
- Department of Psychiatry, Semel Institute for Neuroscience and Human Behavior, University of California, Los Angeles, CA, USA
| | - Adam C Naj
- Department of Pathology and Laboratory Medicine, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, USA
- Penn Neurodegeneration Genomics Center, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, USA
- Department of Biostatistics, Epidemiology, and Informatics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, USA
| | - Li-San Wang
- Department of Pathology and Laboratory Medicine, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, USA
- Penn Neurodegeneration Genomics Center, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, USA
| | | | - Günter U Höglinger
- Department of Neurology, LMU University Hospital, Ludwig-Maximilians-Universität (LMU) München; German Center for Neurodegenerative Diseases (DZNE), Munich, Germany; and Munich Cluster for Systems Neurology (SyNergy), Munich, Germany
| | - Gerard D Schellenberg
- Department of Pathology and Laboratory Medicine, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, USA
- Penn Neurodegeneration Genomics Center, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, USA
| | - Daniel H Geschwind
- Movement Disorders Programs, Department of Neurology, David Geffen School of Medicine, University of California, Los Angeles, Los Angeles, CA, USA
- Department of Human Genetics, David Geffen School of Medicine, University of California, Los Angeles, Los Angeles, CA, USA
- Institute of Precision Health, University of California, Los Angeles, Los Angeles, CA, USA
| | - Wan-Ping Lee
- Department of Pathology and Laboratory Medicine, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, USA
- Penn Neurodegeneration Genomics Center, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, USA
| |
Collapse
|
24
|
Ungar RA, Goddard PC, Jensen TD, Degalez F, Smith KS, Jin CA, Bonner DE, Bernstein JA, Wheeler MT, Montgomery SB. Impact of genome build on RNA-seq interpretation and diagnostics. MEDRXIV : THE PREPRINT SERVER FOR HEALTH SCIENCES 2024:2024.01.11.24301165. [PMID: 38260490 PMCID: PMC10802764 DOI: 10.1101/2024.01.11.24301165] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/24/2024]
Abstract
Transcriptomics is a powerful tool for unraveling the molecular effects of genetic variants and disease diagnosis. Prior studies have demonstrated that choice of genome build impacts variant interpretation and diagnostic yield for genomic analyses. To identify the extent genome build also impacts transcriptomics analyses, we studied the effect of the hg19, hg38, and CHM13 genome builds on expression quantification and outlier detection in 386 rare disease and familial control samples from both the Undiagnosed Diseases Network (UDN) and Genomics Research to Elucidate the Genetics of Rare Disease (GREGoR) Consortium. We identified 2,800 genes with build-dependent quantification across six routinely-collected biospecimens, including 1,391 protein-coding genes and 341 known rare disease genes. We further observed multiple genes that only have detectable expression in a subset of genome builds. Finally, we characterized how genome build impacts the detection of outlier transcriptomic events. Combined, we provide a database of genes impacted by build choice, and recommend that transcriptomics-guided analyses and diagnoses are cross-referenced with these data for robustness.
Collapse
Affiliation(s)
- Rachel A. Ungar
- Department of Genetics, School of Medicine, Stanford University
- Department of Pathology, School of Medicine, Stanford University
| | - Pagé C. Goddard
- Department of Genetics, School of Medicine, Stanford University
- Department of Pathology, School of Medicine, Stanford University
| | - Tanner D. Jensen
- Department of Genetics, School of Medicine, Stanford University
- Department of Pathology, School of Medicine, Stanford University
| | | | - Kevin S. Smith
- Department of Pathology, School of Medicine, Stanford University
| | | | | | - Devon E. Bonner
- Department of Pediatrics, School of Medicine, Stanford University
- Stanford Center for Undiagnosed Diseases, Stanford University
| | | | - Matthew T. Wheeler
- Department of Cardiovascular Medicine, School of Medicine, Stanford University
| | - Stephen B. Montgomery
- Department of Genetics, School of Medicine, Stanford University
- Department of Pathology, School of Medicine, Stanford University
- Department of Biomedical Data Science, Stanford University
| |
Collapse
|
25
|
Bett VK, Macon A, Vicoso B, Elkrewi M. Chromosome-Level Assembly of Artemia franciscana Sheds Light on Sex Chromosome Differentiation. Genome Biol Evol 2024; 16:evae006. [PMID: 38245839 PMCID: PMC10827361 DOI: 10.1093/gbe/evae006] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/21/2023] [Revised: 11/27/2023] [Accepted: 12/21/2023] [Indexed: 01/22/2024] Open
Abstract
Since the commercialization of brine shrimp (genus Artemia) in the 1950s, this lineage, and in particular the model species Artemia franciscana, has been the subject of extensive research. However, our understanding of the genetic mechanisms underlying various aspects of their reproductive biology, including sex determination, is still lacking. This is partly due to the scarcity of genomic resources for Artemia species and crustaceans in general. Here, we present a chromosome-level genome assembly of A. franciscana (Kellogg 1906), from the Great Salt Lake, United States. The genome is 1 GB, and the majority of the genome (81%) is scaffolded into 21 linkage groups using a previously published high-density linkage map. We performed coverage and FST analyses using male and female genomic and transcriptomic reads to quantify the extent of differentiation between the Z and W chromosomes. Additionally, we quantified the expression levels in male and female heads and gonads and found further evidence for dosage compensation in this species.
Collapse
Affiliation(s)
| | - Ariana Macon
- Institute of Science and Technology Austria (ISTA), Klosterneuburg 3400, Austria
| | - Beatriz Vicoso
- Institute of Science and Technology Austria (ISTA), Klosterneuburg 3400, Austria
| | - Marwan Elkrewi
- Institute of Science and Technology Austria (ISTA), Klosterneuburg 3400, Austria
| |
Collapse
|
26
|
Silva JM, Pinho AJ, Pratas D. AltaiR: a C toolkit for alignment-free and temporal analysis of multi-FASTA data. Gigascience 2024; 13:giae086. [PMID: 39589438 PMCID: PMC11590114 DOI: 10.1093/gigascience/giae086] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/18/2023] [Revised: 06/18/2024] [Accepted: 10/14/2024] [Indexed: 11/27/2024] Open
Abstract
BACKGROUND Most viral genome sequences generated during the latest pandemic have presented new challenges for computational analysis. Analyzing millions of viral genomes in multi-FASTA format is computationally demanding, especially when using alignment-based methods. Most existing methods are not designed to handle such large datasets, often requiring the analysis to be divided into smaller parts to obtain results using available computational resources. FINDINGS We introduce AltaiR, a toolkit for analyzing multiple sequences in multi-FASTA format using exclusively alignment-free methodologies. AltaiR enables the identification of singularity and similarity patterns within sequences and computes static and temporal dynamics without restrictions on the number or size of input sequences. It automatically filters low-quality, biased, or deviant data. We demonstrate AltaiR's capabilities by analyzing more than 1.5 million full severe acute respiratory virus coronavirus 2 sequences, revealing interesting observations regarding viral genome characteristics over time, such as shifts in nucleotide composition, decreases in average Kolmogorov sequence complexity, and the evolution of the smallest sequences not found in the human host. CONCLUSIONS AltaiR can identify temporal characteristics and trends in large numbers of sequences, making it ideal for scenarios involving endemic or epidemic outbreaks with vast amounts of available sequence data. Implemented in C with multithreading and methodological optimizations, AltaiR is computationally efficient, flexible, and dependency-free. It accepts any sequence in FASTA format, including amino acid sequences. The complete toolkit is freely available at https://github.com/cobilab/altair.
Collapse
Affiliation(s)
- Jorge M Silva
- IEETA/LASI, Institute of Electronics and Informatics Engineering of Aveiro, University of Aveiro, Aveiro, Portugal
- DETI, Department of Electronics, Telecommunications and Informatics, University of Aveiro, Aveiro, Portugal
| | - Armando J Pinho
- IEETA/LASI, Institute of Electronics and Informatics Engineering of Aveiro, University of Aveiro, Aveiro, Portugal
- DETI, Department of Electronics, Telecommunications and Informatics, University of Aveiro, Aveiro, Portugal
| | - Diogo Pratas
- IEETA/LASI, Institute of Electronics and Informatics Engineering of Aveiro, University of Aveiro, Aveiro, Portugal
- DETI, Department of Electronics, Telecommunications and Informatics, University of Aveiro, Aveiro, Portugal
- DoV, Department of Virology, University of Helsinki, Helsinki, Finland
| |
Collapse
|
27
|
Harvey WT, Ebert P, Ebler J, Audano PA, Munson KM, Hoekzema K, Porubsky D, Beck CR, Marschall T, Garimella K, Eichler EE. Whole-genome long-read sequencing downsampling and its effect on variant-calling precision and recall. Genome Res 2023; 33:2029-2040. [PMID: 38190646 PMCID: PMC10760522 DOI: 10.1101/gr.278070.123] [Citation(s) in RCA: 7] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/04/2023] [Accepted: 11/03/2023] [Indexed: 01/10/2024]
Abstract
Advances in long-read sequencing (LRS) technologies continue to make whole-genome sequencing more complete, affordable, and accurate. LRS provides significant advantages over short-read sequencing approaches, including phased de novo genome assembly, access to previously excluded genomic regions, and discovery of more complex structural variants (SVs) associated with disease. Limitations remain with respect to cost, scalability, and platform-dependent read accuracy and the tradeoffs between sequence coverage and sensitivity of variant discovery are important experimental considerations for the application of LRS. We compare the genetic variant-calling precision and recall of Oxford Nanopore Technologies (ONT) and Pacific Biosciences (PacBio) HiFi platforms over a range of sequence coverages. For read-based applications, LRS sensitivity begins to plateau around 12-fold coverage with a majority of variants called with reasonable accuracy (F1 score above 0.5), and both platforms perform well for SV detection. Genome assembly increases variant-calling precision and recall of SVs and indels in HiFi data sets with HiFi outperforming ONT in quality as measured by the F1 score of assembly-based variant call sets. While both technologies continue to evolve, our work offers guidance to design cost-effective experimental strategies that do not compromise on discovering novel biology.
Collapse
Affiliation(s)
- William T Harvey
- Department of Genome Sciences, University of Washington School of Medicine, Seattle, Washington 98195-5065, USA
| | - Peter Ebert
- Institute for Medical Biometry and Bioinformatics, Medical Faculty, Heinrich Heine University, 40225 Düsseldorf, Germany
- Core Unit Bioinformatics, Medical Faculty, Heinrich Heine University, 40225 Düsseldorf, Germany
- Center for Digital Medicine, Heinrich Heine University, 40225 Düsseldorf, Germany
| | - Jana Ebler
- Institute for Medical Biometry and Bioinformatics, Medical Faculty, Heinrich Heine University, 40225 Düsseldorf, Germany
- Center for Digital Medicine, Heinrich Heine University, 40225 Düsseldorf, Germany
| | - Peter A Audano
- The Jackson Laboratory for Genomic Medicine, Farmington, Connecticut 06032, USA
| | - Katherine M Munson
- Department of Genome Sciences, University of Washington School of Medicine, Seattle, Washington 98195-5065, USA
| | - Kendra Hoekzema
- Department of Genome Sciences, University of Washington School of Medicine, Seattle, Washington 98195-5065, USA
| | - David Porubsky
- Department of Genome Sciences, University of Washington School of Medicine, Seattle, Washington 98195-5065, USA
| | - Christine R Beck
- The Jackson Laboratory for Genomic Medicine, Farmington, Connecticut 06032, USA
- Department of Genetics and Genome Sciences, University of Connecticut Health Center, Farmington, Connecticut 06030-6403, USA
| | - Tobias Marschall
- Institute for Medical Biometry and Bioinformatics, Medical Faculty, Heinrich Heine University, 40225 Düsseldorf, Germany
- Center for Digital Medicine, Heinrich Heine University, 40225 Düsseldorf, Germany
| | - Kiran Garimella
- Data Sciences Platform, Broad Institute of MIT and Harvard, Cambridge, Massachusetts 02142, USA
| | - Evan E Eichler
- Department of Genome Sciences, University of Washington School of Medicine, Seattle, Washington 98195-5065, USA;
- Howard Hughes Medical Institute, University of Washington, Seattle, Washington 98195, USA
| |
Collapse
|
28
|
Makova KD, Pickett BD, Harris RS, Hartley GA, Cechova M, Pal K, Nurk S, Yoo D, Li Q, Hebbar P, McGrath BC, Antonacci F, Aubel M, Biddanda A, Borchers M, Bomberg E, Bouffard GG, Brooks SY, Carbone L, Carrel L, Carroll A, Chang PC, Chin CS, Cook DE, Craig SJ, de Gennaro L, Diekhans M, Dutra A, Garcia GH, Grady PG, Green RE, Haddad D, Hallast P, Harvey WT, Hickey G, Hillis DA, Hoyt SJ, Jeong H, Kamali K, Kosakovsky Pond SL, LaPolice TM, Lee C, Lewis AP, Loh YHE, Masterson P, McCoy RC, Medvedev P, Miga KH, Munson KM, Pak E, Paten B, Pinto BJ, Potapova T, Rhie A, Rocha JL, Ryabov F, Ryder OA, Sacco S, Shafin K, Shepelev VA, Slon V, Solar SJ, Storer JM, Sudmant PH, Sweetalana, Sweeten A, Tassia MG, Thibaud-Nissen F, Ventura M, Wilson MA, Young AC, Zeng H, Zhang X, Szpiech ZA, Huber CD, Gerton JL, Yi SV, Schatz MC, Alexandrov IA, Koren S, O’Neill RJ, Eichler E, Phillippy AM. The Complete Sequence and Comparative Analysis of Ape Sex Chromosomes. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.11.30.569198. [PMID: 38077089 PMCID: PMC10705393 DOI: 10.1101/2023.11.30.569198] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 12/24/2023]
Abstract
Apes possess two sex chromosomes-the male-specific Y and the X shared by males and females. The Y chromosome is crucial for male reproduction, with deletions linked to infertility. The X chromosome carries genes vital for reproduction and cognition. Variation in mating patterns and brain function among great apes suggests corresponding differences in their sex chromosome structure and evolution. However, due to their highly repetitive nature and incomplete reference assemblies, ape sex chromosomes have been challenging to study. Here, using the state-of-the-art experimental and computational methods developed for the telomere-to-telomere (T2T) human genome, we produced gapless, complete assemblies of the X and Y chromosomes for five great apes (chimpanzee, bonobo, gorilla, Bornean and Sumatran orangutans) and a lesser ape, the siamang gibbon. These assemblies completely resolved ampliconic, palindromic, and satellite sequences, including the entire centromeres, allowing us to untangle the intricacies of ape sex chromosome evolution. We found that, compared to the X, ape Y chromosomes vary greatly in size and have low alignability and high levels of structural rearrangements. This divergence on the Y arises from the accumulation of lineage-specific ampliconic regions and palindromes (which are shared more broadly among species on the X) and from the abundance of transposable elements and satellites (which have a lower representation on the X). Our analysis of Y chromosome genes revealed lineage-specific expansions of multi-copy gene families and signatures of purifying selection. In summary, the Y exhibits dynamic evolution, while the X is more stable. Finally, mapping short-read sequencing data from >100 great ape individuals revealed the patterns of diversity and selection on their sex chromosomes, demonstrating the utility of these reference assemblies for studies of great ape evolution. These complete sex chromosome assemblies are expected to further inform conservation genetics of nonhuman apes, all of which are endangered species.
Collapse
Affiliation(s)
| | - Brandon D. Pickett
- National Human Genome Research Institute, National Institutes of Health, Bethesda, MD, USA
| | | | | | - Monika Cechova
- University of California Santa Cruz, Santa Cruz, CA, USA
| | - Karol Pal
- Penn State University, University Park, PA, USA
| | - Sergey Nurk
- National Human Genome Research Institute, National Institutes of Health, Bethesda, MD, USA
| | - DongAhn Yoo
- University of Washington School of Medicine, Seattle, WA, USA
| | - Qiuhui Li
- Johns Hopkins University, Baltimore, MD, USA
| | - Prajna Hebbar
- University of California Santa Cruz, Santa Cruz, CA, USA
| | | | | | | | | | | | - Erich Bomberg
- University of Münster, Münster, Germany
- MPI for Developmental Biology, Tübingen, Germany
| | - Gerard G. Bouffard
- National Human Genome Research Institute, National Institutes of Health, Bethesda, MD, USA
| | - Shelise Y. Brooks
- National Human Genome Research Institute, National Institutes of Health, Bethesda, MD, USA
| | - Lucia Carbone
- Oregon Health & Science University, Portland, OR, USA
- Oregon National Primate Research Center, Hillsboro, OR, USA
| | - Laura Carrel
- Penn State University School of Medicine, Hershey, PA, USA
| | | | | | - Chen-Shan Chin
- Foundation of Biological Data Sciences, Belmont, CA, USA
| | | | | | | | - Mark Diekhans
- University of California Santa Cruz, Santa Cruz, CA, USA
| | - Amalia Dutra
- National Human Genome Research Institute, National Institutes of Health, Bethesda, MD, USA
| | - Gage H. Garcia
- University of Washington School of Medicine, Seattle, WA, USA
| | | | | | - Diana Haddad
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD, USA
| | - Pille Hallast
- The Jackson Laboratory for Genomic Medicine, Farmington, CT, USA
| | | | - Glenn Hickey
- University of California Santa Cruz, Santa Cruz, CA, USA
| | - David A. Hillis
- University of California Santa Barbara, Santa Barbara, CA, USA
| | | | - Hyeonsoo Jeong
- University of Washington School of Medicine, Seattle, WA, USA
| | | | | | | | - Charles Lee
- The Jackson Laboratory for Genomic Medicine, Farmington, CT, USA
| | | | | | - Patrick Masterson
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD, USA
| | | | | | - Karen H. Miga
- University of California Santa Cruz, Santa Cruz, CA, USA
| | | | - Evgenia Pak
- National Human Genome Research Institute, National Institutes of Health, Bethesda, MD, USA
| | - Benedict Paten
- University of California Santa Cruz, Santa Cruz, CA, USA
| | | | | | - Arang Rhie
- National Human Genome Research Institute, National Institutes of Health, Bethesda, MD, USA
| | | | - Fedor Ryabov
- Masters Program in National Research University Higher School of Economics, Moscow, Russia
| | | | - Samuel Sacco
- University of California Santa Cruz, Santa Cruz, CA, USA
| | | | | | | | - Steven J. Solar
- National Human Genome Research Institute, National Institutes of Health, Bethesda, MD, USA
| | | | | | - Sweetalana
- Penn State University, University Park, PA, USA
| | - Alex Sweeten
- National Human Genome Research Institute, National Institutes of Health, Bethesda, MD, USA
- Johns Hopkins University, Baltimore, MD, USA
| | | | - Françoise Thibaud-Nissen
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD, USA
| | | | | | - Alice C. Young
- National Human Genome Research Institute, National Institutes of Health, Bethesda, MD, USA
| | | | - Xinru Zhang
- Penn State University, University Park, PA, USA
| | | | | | | | - Soojin V. Yi
- University of California Santa Barbara, Santa Barbara, CA, USA
| | | | | | - Sergey Koren
- National Human Genome Research Institute, National Institutes of Health, Bethesda, MD, USA
| | | | - Evan Eichler
- University of Washington School of Medicine, Seattle, WA, USA
- Howard Hughes Medical Institute, University of Washington, Seattle, WA, USA
| | - Adam M. Phillippy
- National Human Genome Research Institute, National Institutes of Health, Bethesda, MD, USA
| |
Collapse
|
29
|
Ferrer A, Stephens ZD, Kocher JPA. Experimental and Computational Approaches to Measure Telomere Length: Recent Advances and Future Directions. Curr Hematol Malig Rep 2023; 18:284-291. [PMID: 37947937 PMCID: PMC10709248 DOI: 10.1007/s11899-023-00717-4] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 10/30/2023] [Indexed: 11/12/2023]
Abstract
PURPOSE OF REVIEW The length of telomeres, protective structures at the chromosome ends, is a well-established biomarker for pathological conditions including multisystemic syndromes called telomere biology disorders. Approaches to measure telomere length (TL) differ on whether they estimate average, distribution, or chromosome-specific TL, and each presents their own advantages and limitations. RECENT FINDINGS The development of long-read sequencing and publication of the telomere-to-telomere human genome reference has allowed for scalable and high-resolution TL estimation in pre-existing sequencing datasets but is still impractical as a dedicated TL test. As sequencing costs continue to fall and strategies for selectively enriching telomere regions prior to sequencing improve, these approaches may become a promising alternative to classic methods. Measurement methods rely on probe hybridization, qPCR or more recently, computational methods using sequencing data. Refinements of existing techniques and new approaches have been recently developed but a test that is accurate, simple, and scalable is still lacking.
Collapse
Affiliation(s)
- Alejandro Ferrer
- Division of Hematology, Mayo Clinic, Rochester, 200 First Street SW, Rochester, MN, USA.
- Center for Individualized Medicine, Mayo Clinic, Rochester, MN, USA.
| | | | | |
Collapse
|
30
|
He Y, Chu Y, Guo S, Hu J, Li R, Zheng Y, Ma X, Du Z, Zhao L, Yu W, Xue J, Bian W, Yang F, Chen X, Zhang P, Wu R, Ma Y, Shao C, Chen J, Wang J, Li J, Wu J, Hu X, Long Q, Jiang M, Ye H, Song S, Li G, Wei Y, Xu Y, Ma Y, Chen Y, Wang K, Bao J, Xi W, Wang F, Ni W, Zhang M, Yu Y, Li S, Kang Y, Gao Z. T2T-YAO: A Telomere-to-telomere Assembled Diploid Reference Genome for Han Chinese. GENOMICS, PROTEOMICS & BIOINFORMATICS 2023; 21:1085-1100. [PMID: 37595788 PMCID: PMC11082261 DOI: 10.1016/j.gpb.2023.08.001] [Citation(s) in RCA: 5] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 07/18/2023] [Revised: 08/01/2023] [Accepted: 08/08/2023] [Indexed: 08/20/2023]
Abstract
Since its initial release in 2001, the human reference genome has undergone continuous improvement in quality, and the recently released telomere-to-telomere (T2T) version - T2T-CHM13 - reaches its highest level of continuity and accuracy after 20 years of effort by working on a simplified, nearly homozygous genome of a hydatidiform mole cell line. Here, to provide an authentic complete diploid human genome reference for the Han Chinese, the largest population in the world, we assembled the genome of a male Han Chinese individual, T2T-YAO, which includes T2T assemblies of all the 22 + X + M and 22 + Y chromosomes in both haploids. The quality of T2T-YAO is much better than those of all currently available diploid assemblies, and its haploid version, T2T-YAO-hp, generated by selecting the better assembly for each autosome, reaches the top quality of fewer than one error per 29.5 Mb, even higher than that of T2T-CHM13. Derived from an individual living in the aboriginal region of the Han population, T2T-YAO shows clear ancestry and potential genetic continuity from the ancient ancestors. Each haplotype of T2T-YAO possesses ∼ 330-Mb exclusive sequences, ∼ 3100 unique genes, and tens of thousands of nucleotide and structural variations as compared with CHM13, highlighting the necessity of a population-stratified reference genome. The construction of T2T-YAO, an accurate and authentic representative of the Chinese population, would enable precise delineation of genomic variations and advance our understandings in the hereditability of diseases and phenotypes, especially within the context of the unique variations of the Chinese population.
Collapse
Affiliation(s)
- Yukun He
- Department of Respiratory and Critical Care Medicine, Peking University People's Hospital, Beijing 100044, China
| | - Yanan Chu
- CAS Key Laboratory of Genome Sciences and Information, Beijing Institute of Genomics, Chinese Academy of Sciences and China National Center for Bioinformation, Beijing 100101, China
| | - Shuming Guo
- Linfen Clinical Medicine Research Center, Linfen 041000, China; Institute of Chest and Lung Diseases, Shanxi Medical University, Taiyuan 030001, China
| | - Jiang Hu
- GrandOmics Biosciences Co., Ltd, Wuhan 430076, China
| | - Ran Li
- Department of Respiratory and Critical Care Medicine, Peking University People's Hospital, Beijing 100044, China; Beijing Key Laboratory of Genome and Precision Medicine Technologies, Beijing 100101, China
| | - Yali Zheng
- Department of Respiratory, Critical Care and Sleep Medicine, Xiang'an Hospital of Xiamen University, School of Medicine, Xiamen University, Xiamen 361101, China
| | - Xinqian Ma
- Department of Respiratory and Critical Care Medicine, Peking University People's Hospital, Beijing 100044, China; Beijing Key Laboratory of Genome and Precision Medicine Technologies, Beijing 100101, China
| | - Zhenglin Du
- Institute of PSI Genomics, Wenzhou 325024, China
| | - Lili Zhao
- Department of Respiratory and Critical Care Medicine, Peking University People's Hospital, Beijing 100044, China; Beijing Key Laboratory of Genome and Precision Medicine Technologies, Beijing 100101, China
| | - Wenyi Yu
- Department of Respiratory and Critical Care Medicine, Peking University People's Hospital, Beijing 100044, China; Beijing Key Laboratory of Genome and Precision Medicine Technologies, Beijing 100101, China
| | - Jianbo Xue
- Department of Respiratory and Critical Care Medicine, Peking University People's Hospital, Beijing 100044, China; Beijing Key Laboratory of Genome and Precision Medicine Technologies, Beijing 100101, China
| | - Wenjie Bian
- Department of Respiratory and Critical Care Medicine, Peking University People's Hospital, Beijing 100044, China; Beijing Key Laboratory of Genome and Precision Medicine Technologies, Beijing 100101, China
| | - Feifei Yang
- Department of Respiratory and Critical Care Medicine, Peking University People's Hospital, Beijing 100044, China; Beijing Key Laboratory of Genome and Precision Medicine Technologies, Beijing 100101, China
| | - Xi Chen
- Department of Respiratory and Critical Care Medicine, Peking University People's Hospital, Beijing 100044, China; Beijing Key Laboratory of Genome and Precision Medicine Technologies, Beijing 100101, China
| | - Pingan Zhang
- Department of Respiratory and Critical Care Medicine, Peking University People's Hospital, Beijing 100044, China; Beijing Key Laboratory of Genome and Precision Medicine Technologies, Beijing 100101, China
| | - Rihan Wu
- Department of Respiratory and Critical Care Medicine, Peking University People's Hospital, Beijing 100044, China; Beijing Key Laboratory of Genome and Precision Medicine Technologies, Beijing 100101, China
| | - Yifan Ma
- Department of Respiratory and Critical Care Medicine, Peking University People's Hospital, Beijing 100044, China; Beijing Key Laboratory of Genome and Precision Medicine Technologies, Beijing 100101, China
| | - Changjun Shao
- CAS Key Laboratory of Genome Sciences and Information, Beijing Institute of Genomics, Chinese Academy of Sciences and China National Center for Bioinformation, Beijing 100101, China
| | - Jing Chen
- CAS Key Laboratory of Genome Sciences and Information, Beijing Institute of Genomics, Chinese Academy of Sciences and China National Center for Bioinformation, Beijing 100101, China
| | - Jian Wang
- CAS Key Laboratory of Genome Sciences and Information, Beijing Institute of Genomics, Chinese Academy of Sciences and China National Center for Bioinformation, Beijing 100101, China
| | - Jiwei Li
- Department of Respiratory, Critical Care and Sleep Medicine, Xiang'an Hospital of Xiamen University, School of Medicine, Xiamen University, Xiamen 361101, China
| | - Jing Wu
- Department of Respiratory, Critical Care and Sleep Medicine, Xiang'an Hospital of Xiamen University, School of Medicine, Xiamen University, Xiamen 361101, China
| | - Xiaoyi Hu
- Department of Respiratory, Critical Care and Sleep Medicine, Xiang'an Hospital of Xiamen University, School of Medicine, Xiamen University, Xiamen 361101, China
| | - Qiuyue Long
- Department of Respiratory, Critical Care and Sleep Medicine, Xiang'an Hospital of Xiamen University, School of Medicine, Xiamen University, Xiamen 361101, China
| | - Mingzheng Jiang
- Department of Respiratory, Critical Care and Sleep Medicine, Xiang'an Hospital of Xiamen University, School of Medicine, Xiamen University, Xiamen 361101, China
| | - Hongli Ye
- Department of Respiratory, Critical Care and Sleep Medicine, Xiang'an Hospital of Xiamen University, School of Medicine, Xiamen University, Xiamen 361101, China
| | - Shixu Song
- Department of Respiratory, Critical Care and Sleep Medicine, Xiang'an Hospital of Xiamen University, School of Medicine, Xiamen University, Xiamen 361101, China
| | - Guangyao Li
- Linfen Clinical Medicine Research Center, Linfen 041000, China
| | - Yue Wei
- Linfen Clinical Medicine Research Center, Linfen 041000, China
| | - Yu Xu
- Beijing Jishuitan Hospital, Capital Medical University, Beijing 100035, China
| | - Yanliang Ma
- Department of Respiratory and Critical Care Medicine, Peking University People's Hospital, Beijing 100044, China; Beijing Key Laboratory of Genome and Precision Medicine Technologies, Beijing 100101, China
| | - Yanwen Chen
- Department of Respiratory and Critical Care Medicine, Peking University People's Hospital, Beijing 100044, China; Beijing Key Laboratory of Genome and Precision Medicine Technologies, Beijing 100101, China
| | - Keqiang Wang
- Department of Respiratory and Critical Care Medicine, Peking University People's Hospital, Beijing 100044, China; Beijing Key Laboratory of Genome and Precision Medicine Technologies, Beijing 100101, China
| | - Jing Bao
- Department of Respiratory and Critical Care Medicine, Peking University People's Hospital, Beijing 100044, China; Beijing Key Laboratory of Genome and Precision Medicine Technologies, Beijing 100101, China
| | - Wen Xi
- Department of Respiratory and Critical Care Medicine, Peking University People's Hospital, Beijing 100044, China; Beijing Key Laboratory of Genome and Precision Medicine Technologies, Beijing 100101, China
| | - Fang Wang
- Department of Respiratory and Critical Care Medicine, Peking University People's Hospital, Beijing 100044, China; Beijing Key Laboratory of Genome and Precision Medicine Technologies, Beijing 100101, China
| | - Wentao Ni
- Department of Respiratory and Critical Care Medicine, Peking University People's Hospital, Beijing 100044, China; Beijing Key Laboratory of Genome and Precision Medicine Technologies, Beijing 100101, China
| | - Moqin Zhang
- Department of Respiratory and Critical Care Medicine, Peking University People's Hospital, Beijing 100044, China; Beijing Key Laboratory of Genome and Precision Medicine Technologies, Beijing 100101, China
| | - Yan Yu
- Department of Respiratory and Critical Care Medicine, Peking University People's Hospital, Beijing 100044, China; Beijing Key Laboratory of Genome and Precision Medicine Technologies, Beijing 100101, China
| | - Shengnan Li
- Department of Respiratory and Critical Care Medicine, Peking University People's Hospital, Beijing 100044, China; Beijing Key Laboratory of Genome and Precision Medicine Technologies, Beijing 100101, China
| | - Yu Kang
- CAS Key Laboratory of Genome Sciences and Information, Beijing Institute of Genomics, Chinese Academy of Sciences and China National Center for Bioinformation, Beijing 100101, China; University of Chinese Academy of Sciences, Beijing 100490, China.
| | - Zhancheng Gao
- Department of Respiratory and Critical Care Medicine, Peking University People's Hospital, Beijing 100044, China; Institute of Chest and Lung Diseases, Shanxi Medical University, Taiyuan 030001, China; Beijing Key Laboratory of Genome and Precision Medicine Technologies, Beijing 100101, China.
| |
Collapse
|
31
|
Delorean EE, Youngblood RC, Simpson SA, Schoonmaker AN, Scheffler BE, Rutter WB, Hulse-Kemp AM. Representing true plant genomes: haplotype-resolved hybrid pepper genome with trio-binning. FRONTIERS IN PLANT SCIENCE 2023; 14:1184112. [PMID: 38034563 PMCID: PMC10687446 DOI: 10.3389/fpls.2023.1184112] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 03/11/2023] [Accepted: 10/17/2023] [Indexed: 12/02/2023]
Abstract
As sequencing costs decrease and availability of high fidelity long-read sequencing increases, generating experiment specific de novo genome assemblies becomes feasible. In many crop species, obtaining the genome of a hybrid or heterozygous individual is necessary for systems that do not tolerate inbreeding or for investigating important biological questions, such as hybrid vigor. However, most genome assembly methods that have been used in plants result in a merged single sequence representation that is not a true biologically accurate representation of either haplotype within a diploid individual. The resulting genome assembly is often fragmented and exhibits a mosaic of the two haplotypes, referred to as haplotype-switching. Important haplotype level information, such as causal mutations and structural variation is therefore lost causing difficulties in interpreting downstream analyses. To overcome this challenge, we have applied a method developed for animal genome assembly called trio-binning to an intra-specific hybrid of chili pepper (Capsicum annuum L. cv. HDA149 x Capsicum annuum L. cv. HDA330). We tested all currently available softwares for performing trio-binning, combined with multiple scaffolding technologies including Bionano to determine the optimal method of producing the best haplotype-resolved assembly. Ultimately, we produced highly contiguous biologically true haplotype-resolved genome assemblies for each parent, with scaffold N50s of 266.0 Mb and 281.3 Mb, with 99.6% and 99.8% positioned into chromosomes respectively. The assemblies captured 3.10 Gb and 3.12 Gb of the estimated 3.5 Gb chili pepper genome size. These assemblies represent the complete genome structure of the intraspecific hybrid, as well as the two parental genomes, and show measurable improvements over the currently available reference genomes. Our manuscript provides a valuable guide on how to apply trio-binning to other plant genomes.
Collapse
Affiliation(s)
- Emily E. Delorean
- Genomics and Bioinformatics Research Unit, USDA-ARS, Raleigh, NC, United States
- Crop and Soil Sciences Department, North Carolina State University, Raleigh, NC, United States
| | - Ramey C. Youngblood
- Institute for Genomics, Biocomputing and Biotechnology, Mississippi State University, Starkville, MS, United States
| | - Sheron A. Simpson
- Genomics and Bioinformatics Research Unit, United States Department of Agriculture - Agriculture Research Service (USDA-ARS), Stoneville, MS, United States
| | - Ashley N. Schoonmaker
- Crop and Soil Sciences Department, North Carolina State University, Raleigh, NC, United States
| | - Brian E. Scheffler
- Genomics and Bioinformatics Research Unit, United States Department of Agriculture - Agriculture Research Service (USDA-ARS), Stoneville, MS, United States
| | - William B. Rutter
- US Vegetable Laboratory, United States Department of Agriculture - Agriculture Research Service (USDA-ARS), Charleston, SC, United States
| | - Amanda M. Hulse-Kemp
- Genomics and Bioinformatics Research Unit, USDA-ARS, Raleigh, NC, United States
- Crop and Soil Sciences Department, North Carolina State University, Raleigh, NC, United States
| |
Collapse
|
32
|
Zeng T, He Z, He J, Lv W, Huang S, Li J, Zhu L, Wan S, Zhou W, Yang Z, Zhang Y, Luo C, He J, Wang C, Wang L. The telomere-to-telomere gap-free reference genome of wild blueberry ( Vaccinium duclouxii) provides its high soluble sugar and anthocyanin accumulation. HORTICULTURE RESEARCH 2023; 10:uhad209. [PMID: 38023474 PMCID: PMC10681038 DOI: 10.1093/hr/uhad209] [Citation(s) in RCA: 8] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 03/30/2023] [Revised: 10/19/2023] [Indexed: 12/01/2023]
Abstract
Vaccinium duclouxii, endemic to southwestern China, is a berry-producing shrub or small tree belonging to the Ericaceae family, with high nutritive, medicinal, and ornamental value, abundant germplasm resources, and good edible properties. In addition, V. duclouxii exhibits strong tolerance to adverse environmental conditions, making it a promising candidate for research and offering wide-ranging possibilities for utilization. However, the lack of V. duclouxii genome sequence has hampered its development and utilization. Here, a high-quality telomere-to-telomere genome sequence of V. duclouxii was de novo assembled and annotated. All of 12 chromosomes were assembled into gap-free single contigs, providing the highest integrity and quality assembly reported so far for blueberry. The V. duclouxii genome is 573.67 Mb, which encodes 41 953 protein-coding genes. Combining transcriptomics and metabolomics analyses, we have uncovered the molecular mechanisms involved in sugar and acid accumulation and anthocyanin biosynthesis in V. duclouxii. This provides essential molecular information for further research on the quality of V. duclouxii. Moreover, the high-quality telomere-to-telomere assembly of the V. duclouxii genome will provide insights into the genomic evolution of Vaccinium and support advancements in blueberry genetics and molecular breeding.
Collapse
Affiliation(s)
- Tuo Zeng
- School of Life Sciences, Guizhou Normal University, Guiyang 550000, China
| | - Zhijiao He
- Institute of Alpine Economic Plant, Yunnan Academy of Agricultural Sciences, Lijiang 674199, Yunnan, China
| | - Jiefang He
- School of Life Sciences, Guizhou Normal University, Guiyang 550000, China
| | - Wei Lv
- School of Life Sciences, Guizhou Normal University, Guiyang 550000, China
| | - Shixiang Huang
- School of Life Sciences, Guizhou Normal University, Guiyang 550000, China
| | - Jiawen Li
- School of Advanced Agricultural Sciences, Peking University, 100871 Beijing, China
| | - Liyong Zhu
- National Key Laboratory for Germplasm Innovation & Utilization of Horticultural Crops, College of Horticulture & Forestry Sciences, Huazhong Agricultural University, Wuhan 430070, China
| | - Shuang Wan
- Wuhan Benagen Technology Co., Ltd, Wuhan 430070, China
| | - Wanfei Zhou
- National Key Laboratory for Germplasm Innovation & Utilization of Horticultural Crops, College of Horticulture & Forestry Sciences, Huazhong Agricultural University, Wuhan 430070, China
| | - Zhengsong Yang
- Institute of Alpine Economic Plant, Yunnan Academy of Agricultural Sciences, Lijiang 674199, Yunnan, China
| | - Yatao Zhang
- School of Life Sciences, Guizhou Normal University, Guiyang 550000, China
| | - Chong Luo
- School of Life Sciences, Guizhou Normal University, Guiyang 550000, China
| | - Jiawei He
- Institute of Alpine Economic Plant, Yunnan Academy of Agricultural Sciences, Lijiang 674199, Yunnan, China
| | - Caiyun Wang
- National Key Laboratory for Germplasm Innovation & Utilization of Horticultural Crops, College of Horticulture & Forestry Sciences, Huazhong Agricultural University, Wuhan 430070, China
| | - Liangsheng Wang
- Key Laboratory of Plant Resources, Institute of Botany, Chinese Academy of Sciences, Beijing 100093, China
- China National Botanical Garden, Beijing 100093, China
- University of Chinese Academy of Sciences, Beijing 100049, China
| |
Collapse
|
33
|
Schelkunov MI. Mabs, a suite of tools for gene-informed genome assembly. BMC Bioinformatics 2023; 24:377. [PMID: 37794322 PMCID: PMC10548655 DOI: 10.1186/s12859-023-05499-3] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/04/2023] [Accepted: 09/26/2023] [Indexed: 10/06/2023] Open
Abstract
BACKGROUND Despite constantly improving genome sequencing methods, error-free eukaryotic genome assembly has not yet been achieved. Among other kinds of problems of eukaryotic genome assembly are so-called "haplotypic duplications", which may manifest themselves as cases of alleles being mistakenly assembled as paralogues. Haplotypic duplications are dangerous because they create illusions of gene family expansions and, thus, may lead scientists to incorrect conclusions about genome evolution and functioning. RESULTS Here, I present Mabs, a suite of tools that serve as parameter optimizers of the popular genome assemblers Hifiasm and Flye. By optimizing the parameters of Hifiasm and Flye, Mabs tries to create genome assemblies with the genes assembled as accurately as possible. Tests on 6 eukaryotic genomes showed that in 6 out of 6 cases, Mabs created assemblies with more accurately assembled genes than those generated by Hifiasm and Flye when they were run with default parameters. When assemblies of Mabs, Hifiasm and Flye were postprocessed by a popular tool for haplotypic duplication removal, Purge_dups, genes were better assembled by Mabs in 5 out of 6 cases. CONCLUSIONS Mabs is useful for making high-quality genome assemblies. It is available at https://github.com/shelkmike/Mabs.
Collapse
|
34
|
Rautiainen M, Nurk S, Walenz BP, Logsdon GA, Porubsky D, Rhie A, Eichler EE, Phillippy AM, Koren S. Telomere-to-telomere assembly of diploid chromosomes with Verkko. Nat Biotechnol 2023; 41:1474-1482. [PMID: 36797493 PMCID: PMC10427740 DOI: 10.1038/s41587-023-01662-6] [Citation(s) in RCA: 134] [Impact Index Per Article: 67.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/24/2022] [Accepted: 01/03/2023] [Indexed: 02/18/2023]
Abstract
The Telomere-to-Telomere consortium recently assembled the first truly complete sequence of a human genome. To resolve the most complex repeats, this project relied on manual integration of ultra-long Oxford Nanopore sequencing reads with a high-resolution assembly graph built from long, accurate PacBio high-fidelity reads. We have improved and automated this strategy in Verkko, an iterative, graph-based pipeline for assembling complete, diploid genomes. Verkko begins with a multiplex de Bruijn graph built from long, accurate reads and progressively simplifies this graph by integrating ultra-long reads and haplotype-specific markers. The result is a phased, diploid assembly of both haplotypes, with many chromosomes automatically assembled from telomere to telomere. Running Verkko on the HG002 human genome resulted in 20 of 46 diploid chromosomes assembled without gaps at 99.9997% accuracy. The complete assembly of diploid genomes is a critical step towards the construction of comprehensive pangenome databases and chromosome-scale comparative genomics.
Collapse
Affiliation(s)
- Mikko Rautiainen
- Genome Informatics Section, Computational and Statistical Genomics Branch, National Human Genome Research Institute, National Institutes of Health, Bethesda, MD, USA
| | - Sergey Nurk
- Genome Informatics Section, Computational and Statistical Genomics Branch, National Human Genome Research Institute, National Institutes of Health, Bethesda, MD, USA
- Oxford Nanopore Technologies, Oxford, UK
| | - Brian P Walenz
- Genome Informatics Section, Computational and Statistical Genomics Branch, National Human Genome Research Institute, National Institutes of Health, Bethesda, MD, USA
| | - Glennis A Logsdon
- Department of Genome Sciences, University of Washington School of Medicine, Seattle, WA, USA
| | - David Porubsky
- Department of Genome Sciences, University of Washington School of Medicine, Seattle, WA, USA
| | - Arang Rhie
- Genome Informatics Section, Computational and Statistical Genomics Branch, National Human Genome Research Institute, National Institutes of Health, Bethesda, MD, USA
| | - Evan E Eichler
- Department of Genome Sciences, University of Washington School of Medicine, Seattle, WA, USA
- Howard Hughes Medical Institute, University of Washington, Seattle, WA, USA
| | - Adam M Phillippy
- Genome Informatics Section, Computational and Statistical Genomics Branch, National Human Genome Research Institute, National Institutes of Health, Bethesda, MD, USA.
| | - Sergey Koren
- Genome Informatics Section, Computational and Statistical Genomics Branch, National Human Genome Research Institute, National Institutes of Health, Bethesda, MD, USA.
| |
Collapse
|
35
|
Yang C, Zhou Y, Song Y, Wu D, Zeng Y, Nie L, Liu P, Zhang S, Chen G, Xu J, Zhou H, Zhou L, Qian X, Liu C, Tan S, Zhou C, Dai W, Xu M, Qi Y, Wang X, Guo L, Fan G, Wang A, Deng Y, Zhang Y, Jin J, He Y, Guo C, Guo G, Zhou Q, Xu X, Yang H, Wang J, Xu S, Mao Y, Jin X, Ruan J, Zhang G. The complete and fully-phased diploid genome of a male Han Chinese. Cell Res 2023; 33:745-761. [PMID: 37452091 PMCID: PMC10542383 DOI: 10.1038/s41422-023-00849-5] [Citation(s) in RCA: 26] [Impact Index Per Article: 13.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/06/2023] [Accepted: 06/29/2023] [Indexed: 07/18/2023] Open
Abstract
Since the release of the complete human genome, the priority of human genomic study has now been shifting towards closing gaps in ethnic diversity. Here, we present a fully phased and well-annotated diploid human genome from a Han Chinese male individual (CN1), in which the assemblies of both haploids achieve the telomere-to-telomere (T2T) level. Comparison of this diploid genome with the CHM13 haploid T2T genome revealed significant variations in the centromere. Outside the centromere, we discovered 11,413 structural variations, including numerous novel ones. We also detected thousands of CN1 alleles that have accumulated high substitution rates and a few that have been under positive selection in the East Asian population. Further, we found that CN1 outperforms CHM13 as a reference genome in mapping and variant calling for the East Asian population owing to the distinct structural variants of the two references. Comparison of SNP calling for a large cohort of 8869 Chinese genomes using CN1 and CHM13 as reference respectively showed that the reference bias profoundly impacts rare SNP calling, with nearly 2 million rare SNPs miss-called with different reference genomes. Finally, applying the CN1 as a reference, we discovered 5.80 Mb and 4.21 Mb putative introgression sequences from Neanderthal and Denisovan, respectively, including many East Asian specific ones undetected using CHM13 as the reference. Our analyses reveal the advances of using CN1 as a reference for population genomic studies and paleo-genomic studies. This complete genome will serve as an alternative reference for future genomic studies on the East Asian population.
Collapse
Affiliation(s)
- Chentao Yang
- Center for Genomic Research, International Institutes of Medicine, The Fourth Affiliated Hospital, Zhejiang University School of Medicine, Yiwu, Zhejiang, China
- Center for Evolutionary & Organismal Biology, & Women's Hospital, Zhejiang University School of Medicine, Hangzhou, Zhejiang, China
- BGI-Shenzhen, Shenzhen, Guangdong, China
| | - Yang Zhou
- BGI-Shenzhen, Shenzhen, Guangdong, China
- BGI Research-Wuhan, BGI, Wuhan, Hubei, China
| | - Yanni Song
- Shenzhen Branch, Guangdong Laboratory for Lingnan Modern Agriculture, Genome Analysis Laboratory of the Ministry of Agriculture and Rural Affairs, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences, Shenzhen, Guangdong, China
| | - Dongya Wu
- Center for Genomic Research, International Institutes of Medicine, The Fourth Affiliated Hospital, Zhejiang University School of Medicine, Yiwu, Zhejiang, China
- Center for Evolutionary & Organismal Biology, & Women's Hospital, Zhejiang University School of Medicine, Hangzhou, Zhejiang, China
- Liangzhu Laboratory, Zhejiang University Medical Center, Hangzhou, Zhejiang, China
- Institute of Crop Science & Institute of Bioinformatics, Zhejiang University, Hangzhou, Zhejiang, China
| | - Yan Zeng
- BGI-Shenzhen, Shenzhen, Guangdong, China
| | - Lei Nie
- BGI-Shenzhen, Shenzhen, Guangdong, China
| | | | - Shilong Zhang
- Bio-X Institutes, Key Laboratory for the Genetics of Developmental and Neuropsychiatric Disorders, Ministry of Education, Shanghai Jiao Tong University, Shanghai, China
| | - Guangji Chen
- BGI-Shenzhen, Shenzhen, Guangdong, China
- College of Life Sciences, University of Chinese Academy of Sciences, Beijing, China
| | - Jinjin Xu
- BGI-Shenzhen, Shenzhen, Guangdong, China
| | - Hongling Zhou
- Shenzhen Branch, Guangdong Laboratory for Lingnan Modern Agriculture, Genome Analysis Laboratory of the Ministry of Agriculture and Rural Affairs, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences, Shenzhen, Guangdong, China
| | - Long Zhou
- Center for Evolutionary & Organismal Biology, & Women's Hospital, Zhejiang University School of Medicine, Hangzhou, Zhejiang, China
- Liangzhu Laboratory, Zhejiang University Medical Center, Hangzhou, Zhejiang, China
- Innovation Center of Yangtze River Delta, Zhejiang University, Hangzhou, Zhejiang, China
| | - Xiaobo Qian
- BGI-Shenzhen, Shenzhen, Guangdong, China
- College of Life Sciences, University of Chinese Academy of Sciences, Beijing, China
| | - Chenlu Liu
- Life Sciences Institute, Zhejiang University, Hangzhou, Zhejiang, China
| | | | | | - Wei Dai
- BGI-Shenzhen, Shenzhen, Guangdong, China
| | - Mengyang Xu
- BGI-Shenzhen, Shenzhen, Guangdong, China
- BGI-Qingdao, BGI-Shenzhen, Qingdao, Shandong, China
| | - Yanwei Qi
- BGI-Qingdao, BGI-Shenzhen, Qingdao, Shandong, China
| | - Xiaobo Wang
- Shenzhen Branch, Guangdong Laboratory for Lingnan Modern Agriculture, Genome Analysis Laboratory of the Ministry of Agriculture and Rural Affairs, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences, Shenzhen, Guangdong, China
| | - Lidong Guo
- College of Life Sciences, University of Chinese Academy of Sciences, Beijing, China
- BGI-Qingdao, BGI-Shenzhen, Qingdao, Shandong, China
| | - Guangyi Fan
- BGI-Qingdao, BGI-Shenzhen, Qingdao, Shandong, China
| | - Aijun Wang
- BGI-Qingdao, BGI-Shenzhen, Qingdao, Shandong, China
| | - Yuan Deng
- BGI-Shenzhen, Shenzhen, Guangdong, China
| | - Yong Zhang
- BGI-Shenzhen, Shenzhen, Guangdong, China
| | | | - Yunqiu He
- Center for Genomic Research, International Institutes of Medicine, The Fourth Affiliated Hospital, Zhejiang University School of Medicine, Yiwu, Zhejiang, China
- Center for Evolutionary & Organismal Biology, & Women's Hospital, Zhejiang University School of Medicine, Hangzhou, Zhejiang, China
| | - Chunxue Guo
- BGI-Shenzhen, Shenzhen, Guangdong, China
- BGI-Hangzhou, Hangzhou, Zhejiang, China
| | - Guoji Guo
- School of Medicine, Zhejiang University, Hangzhou, Zhejiang, China
| | - Qing Zhou
- Liangzhu Laboratory, Zhejiang University Medical Center, Hangzhou, Zhejiang, China
- Life Sciences Institute, Zhejiang University, Hangzhou, Zhejiang, China
| | - Xun Xu
- BGI-Shenzhen, Shenzhen, Guangdong, China
| | | | - Jian Wang
- BGI-Shenzhen, Shenzhen, Guangdong, China
| | - Shuhua Xu
- State Key Laboratory of Genetic Engineering, Center for Evolutionary Biology, Collaborative Innovation Center for Genetics and Development, School of Life Sciences, Fudan University, Shanghai, China
- Human Phenome Institute, Zhangjiang Fudan International Innovation Center, and Ministry of Education Key Laboratory of Contemporary Anthropology, Fudan University, Shanghai, China
- Jiangsu Key Laboratory of Phylogenomics & Comparative Genomics, International Joint Center of Genomics of Jiangsu Province School of Life Sciences, Jiangsu Normal University, Xuzhou, Jiangsu, China
- Department of Liver Surgery and Transplantation Liver Cancer Institute, Zhongshan Hospital, Fudan University, Shanghai, China
- Center for Excellence in Animal Evolution and Genetics, Chinese Academy of Sciences, Kunming, Yunnan, China
| | - Yafei Mao
- Bio-X Institutes, Key Laboratory for the Genetics of Developmental and Neuropsychiatric Disorders, Ministry of Education, Shanghai Jiao Tong University, Shanghai, China
| | - Xin Jin
- BGI-Shenzhen, Shenzhen, Guangdong, China
| | - Jue Ruan
- Shenzhen Branch, Guangdong Laboratory for Lingnan Modern Agriculture, Genome Analysis Laboratory of the Ministry of Agriculture and Rural Affairs, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences, Shenzhen, Guangdong, China.
| | - Guojie Zhang
- Center for Genomic Research, International Institutes of Medicine, The Fourth Affiliated Hospital, Zhejiang University School of Medicine, Yiwu, Zhejiang, China.
- Center for Evolutionary & Organismal Biology, & Women's Hospital, Zhejiang University School of Medicine, Hangzhou, Zhejiang, China.
- Liangzhu Laboratory, Zhejiang University Medical Center, Hangzhou, Zhejiang, China.
- Innovation Center of Yangtze River Delta, Zhejiang University, Hangzhou, Zhejiang, China.
- State Key Laboratory of Genetic Resources and Evolution, Kunming Institute of Zoology, Chinese Academy of Sciences, Kunming, Yunnan, China.
| |
Collapse
|
36
|
Kolesnikov A, Cook D, Nattestad M, McNulty B, Gorzynski J, Goenka S, Ashley EA, Jain M, Miga KH, Paten B, Chang PC, Carroll A, Shafin K. Local read haplotagging enables accurate long-read small variant calling. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.09.07.556731. [PMID: 37745389 PMCID: PMC10515762 DOI: 10.1101/2023.09.07.556731] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 09/26/2023]
Abstract
Long-read sequencing technology has enabled variant detection in difficult-to-map regions of the genome and enabled rapid genetic diagnosis in clinical settings. Rapidly evolving third-generation sequencing platforms like Pacific Biosciences (PacBio) and Oxford nanopore technologies (ONT) are introducing newer platforms and data types. It has been demonstrated that variant calling methods based on deep neural networks can use local haplotyping information with long-reads to improve the genotyping accuracy. However, using local haplotype information creates an overhead as variant calling needs to be performed multiple times which ultimately makes it difficult to extend to new data types and platforms as they get introduced. In this work, we have developed a local haplotype approximate method that enables state-of-the-art variant calling performance with multiple sequencing platforms including PacBio Revio system, ONT R10.4 simplex and duplex data. This addition of local haplotype approximation makes DeepVariant a universal variant calling solution for long-read sequencing platforms.
Collapse
Affiliation(s)
| | - Daniel Cook
- Google Inc, 1600 Amphitheatre Pkwy, Mountain View, CA
| | | | - Brandy McNulty
- UC Santa Cruz Genomics Institute, University of California, Santa Cruz, California, USA
| | | | | | | | - Miten Jain
- Northeastern university, Boston, MA, USA
| | - Karen H Miga
- UC Santa Cruz Genomics Institute, University of California, Santa Cruz, California, USA
| | - Benedict Paten
- UC Santa Cruz Genomics Institute, University of California, Santa Cruz, California, USA
| | | | | | | |
Collapse
|
37
|
Rhie A, Nurk S, Cechova M, Hoyt SJ, Taylor DJ, Altemose N, Hook PW, Koren S, Rautiainen M, Alexandrov IA, Allen J, Asri M, Bzikadze AV, Chen NC, Chin CS, Diekhans M, Flicek P, Formenti G, Fungtammasan A, Garcia Giron C, Garrison E, Gershman A, Gerton JL, Grady PGS, Guarracino A, Haggerty L, Halabian R, Hansen NF, Harris R, Hartley GA, Harvey WT, Haukness M, Heinz J, Hourlier T, Hubley RM, Hunt SE, Hwang S, Jain M, Kesharwani RK, Lewis AP, Li H, Logsdon GA, Lucas JK, Makalowski W, Markovic C, Martin FJ, Mc Cartney AM, McCoy RC, McDaniel J, McNulty BM, Medvedev P, Mikheenko A, Munson KM, Murphy TD, Olsen HE, Olson ND, Paulin LF, Porubsky D, Potapova T, Ryabov F, Salzberg SL, Sauria MEG, Sedlazeck FJ, Shafin K, Shepelev VA, Shumate A, Storer JM, Surapaneni L, Taravella Oill AM, Thibaud-Nissen F, Timp W, Tomaszkiewicz M, Vollger MR, Walenz BP, Watwood AC, Weissensteiner MH, Wenger AM, Wilson MA, Zarate S, Zhu Y, Zook JM, Eichler EE, O'Neill RJ, Schatz MC, Miga KH, Makova KD, Phillippy AM. The complete sequence of a human Y chromosome. Nature 2023; 621:344-354. [PMID: 37612512 PMCID: PMC10752217 DOI: 10.1038/s41586-023-06457-y] [Citation(s) in RCA: 155] [Impact Index Per Article: 77.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/02/2022] [Accepted: 07/19/2023] [Indexed: 08/25/2023]
Abstract
The human Y chromosome has been notoriously difficult to sequence and assemble because of its complex repeat structure that includes long palindromes, tandem repeats and segmental duplications1-3. As a result, more than half of the Y chromosome is missing from the GRCh38 reference sequence and it remains the last human chromosome to be finished4,5. Here, the Telomere-to-Telomere (T2T) consortium presents the complete 62,460,029-base-pair sequence of a human Y chromosome from the HG002 genome (T2T-Y) that corrects multiple errors in GRCh38-Y and adds over 30 million base pairs of sequence to the reference, showing the complete ampliconic structures of gene families TSPY, DAZ and RBMY; 41 additional protein-coding genes, mostly from the TSPY family; and an alternating pattern of human satellite 1 and 3 blocks in the heterochromatic Yq12 region. We have combined T2T-Y with a previous assembly of the CHM13 genome4 and mapped available population variation, clinical variants and functional genomics data to produce a complete and comprehensive reference sequence for all 24 human chromosomes.
Collapse
Affiliation(s)
- Arang Rhie
- Genome Informatics Section, Computational and Statistical Genomics Branch, National Human Genome Research Institute, National Institutes of Health, Bethesda, MD, USA
| | - Sergey Nurk
- Genome Informatics Section, Computational and Statistical Genomics Branch, National Human Genome Research Institute, National Institutes of Health, Bethesda, MD, USA
- Oxford Nanopore Technologies Inc., Oxford, UK
| | - Monika Cechova
- Faculty of Informatics, Masaryk University, Brno, Czech Republic
- Department of Biomolecular Engineering, University of California Santa Cruz, Santa Cruz, CA, USA
| | - Savannah J Hoyt
- Department of Molecular and Cell Biology, University of Connecticut, Storrs, CT, USA
| | - Dylan J Taylor
- Department of Biology, Johns Hopkins University, Baltimore, MD, USA
| | - Nicolas Altemose
- Department of Molecular and Cell Biology, University of California, Berkeley, CA, USA
| | - Paul W Hook
- Department of Biomedical Engineering, Johns Hopkins University, Baltimore, MD, USA
| | - Sergey Koren
- Genome Informatics Section, Computational and Statistical Genomics Branch, National Human Genome Research Institute, National Institutes of Health, Bethesda, MD, USA
| | - Mikko Rautiainen
- Genome Informatics Section, Computational and Statistical Genomics Branch, National Human Genome Research Institute, National Institutes of Health, Bethesda, MD, USA
| | - Ivan A Alexandrov
- Federal Research Center of Biotechnology of the Russian Academy of Sciences, Moscow, Russia
- Center for Algorithmic Biotechnology, Saint Petersburg State University, St Petersburg, Russia
- Department of Anatomy and Anthropology and Department of Human Molecular Genetics and Biochemistry, Sackler Faculty of Medicine, Tel Aviv University, Tel Aviv-Yafo, Israel
| | - Jamie Allen
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge, UK
| | - Mobin Asri
- UC Santa Cruz Genomics Institute, University of California Santa Cruz, Santa Cruz, CA, USA
| | - Andrey V Bzikadze
- Graduate Program in Bioinformatics and Systems Biology, University of California, San Diego, CA, USA
| | - Nae-Chyun Chen
- Department of Computer Science, Johns Hopkins University, Baltimore, MD, USA
| | - Chen-Shan Chin
- GeneDX Holdings Corp, Stamford, CT, USA
- Foundation of Biological Data Science, Belmont, CA, USA
| | - Mark Diekhans
- UC Santa Cruz Genomics Institute, University of California Santa Cruz, Santa Cruz, CA, USA
| | - Paul Flicek
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge, UK
- Department of Genetics, University of Cambridge, Cambridge, UK
| | | | | | - Carlos Garcia Giron
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge, UK
| | - Erik Garrison
- Department of Genetics, Genomics and Informatics, University of Tennessee Health Science Center, Memphis, TN, USA
| | - Ariel Gershman
- Department of Biomedical Engineering, Johns Hopkins University, Baltimore, MD, USA
| | - Jennifer L Gerton
- Stowers Institute for Medical Research, Kansas City, MO, USA
- University of Kansas Medical Center, Kansas City, MO, USA
| | - Patrick G S Grady
- Department of Molecular and Cell Biology, University of Connecticut, Storrs, CT, USA
| | - Andrea Guarracino
- Department of Genetics, Genomics and Informatics, University of Tennessee Health Science Center, Memphis, TN, USA
- Genomics Research Centre, Human Technopole, Milan, Italy
| | - Leanne Haggerty
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge, UK
| | - Reza Halabian
- Institute of Bioinformatics, Faculty of Medicine, University of Münster, Münster, Germany
| | - Nancy F Hansen
- Genome Informatics Section, Computational and Statistical Genomics Branch, National Human Genome Research Institute, National Institutes of Health, Bethesda, MD, USA
- Cancer Genetics and Comparative Genomics Branch, National Human Genome Research Institute, National Institutes of Health, Bethesda, MD, USA
| | - Robert Harris
- Department of Biology, Pennsylvania State University, University Park, PA, USA
| | - Gabrielle A Hartley
- Department of Molecular and Cell Biology, University of Connecticut, Storrs, CT, USA
| | - William T Harvey
- Department of Genome Sciences, University of Washington School of Medicine, Seattle, WA, USA
| | - Marina Haukness
- UC Santa Cruz Genomics Institute, University of California Santa Cruz, Santa Cruz, CA, USA
| | - Jakob Heinz
- Department of Biomedical Engineering, Johns Hopkins University, Baltimore, MD, USA
| | - Thibaut Hourlier
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge, UK
| | | | - Sarah E Hunt
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge, UK
| | - Stephen Hwang
- XDBio Program, Johns Hopkins University, Baltimore, MD, USA
| | - Miten Jain
- Department of Bioengineering, Department of Physics, Northeastern University, Boston, MA, USA
| | - Rupesh K Kesharwani
- Human Genome Sequencing Center, Baylor College of Medicine, One Baylor Plaza, Houston, TX, USA
| | - Alexandra P Lewis
- Department of Genome Sciences, University of Washington School of Medicine, Seattle, WA, USA
| | - Heng Li
- Department of Data Sciences, Dana-Farber Cancer Institute, Boston, MA, USA
- Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA
| | - Glennis A Logsdon
- Department of Genome Sciences, University of Washington School of Medicine, Seattle, WA, USA
| | - Julian K Lucas
- Department of Biomolecular Engineering, University of California Santa Cruz, Santa Cruz, CA, USA
- UC Santa Cruz Genomics Institute, University of California Santa Cruz, Santa Cruz, CA, USA
| | - Wojciech Makalowski
- Institute of Bioinformatics, Faculty of Medicine, University of Münster, Münster, Germany
| | - Christopher Markovic
- Genome Technology Access Center at the McDonnell Genome Institute, Washington University, St. Louis, MO, USA
| | - Fergal J Martin
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge, UK
| | - Ann M Mc Cartney
- Genome Informatics Section, Computational and Statistical Genomics Branch, National Human Genome Research Institute, National Institutes of Health, Bethesda, MD, USA
| | - Rajiv C McCoy
- Department of Biology, Johns Hopkins University, Baltimore, MD, USA
| | - Jennifer McDaniel
- Biosystems and Biomaterials Division, National Institute of Standards and Technology, Gaithersburg, MD, USA
| | - Brandy M McNulty
- Department of Biomolecular Engineering, University of California Santa Cruz, Santa Cruz, CA, USA
- UC Santa Cruz Genomics Institute, University of California Santa Cruz, Santa Cruz, CA, USA
| | - Paul Medvedev
- Department of Computer Science and Engineering, Pennsylvania State University, University Park, PA, USA
- Department of Biochemistry and Molecular Biology, Pennsylvania State University, University Park, PA, USA
- Center for Computational Biology and Bioinformatics, Pennsylvania State University, University Park, PA, USA
| | - Alla Mikheenko
- Center for Algorithmic Biotechnology, Saint Petersburg State University, St Petersburg, Russia
- UCL Queen Square Institute of Neurology, UCL, London, UK
| | - Katherine M Munson
- Department of Genome Sciences, University of Washington School of Medicine, Seattle, WA, USA
| | - Terence D Murphy
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD, USA
| | - Hugh E Olsen
- Department of Biomolecular Engineering, University of California Santa Cruz, Santa Cruz, CA, USA
- UC Santa Cruz Genomics Institute, University of California Santa Cruz, Santa Cruz, CA, USA
| | - Nathan D Olson
- Biosystems and Biomaterials Division, National Institute of Standards and Technology, Gaithersburg, MD, USA
| | - Luis F Paulin
- Human Genome Sequencing Center, Baylor College of Medicine, One Baylor Plaza, Houston, TX, USA
| | - David Porubsky
- Department of Genome Sciences, University of Washington School of Medicine, Seattle, WA, USA
| | - Tamara Potapova
- Stowers Institute for Medical Research, Kansas City, MO, USA
| | - Fedor Ryabov
- Masters Program in National Research University Higher School of Economics, Moscow, Russia
| | - Steven L Salzberg
- Departments of Biomedical Engineering, Computer Science, and Biostatistics, Johns Hopkins University, Baltimore, MD, USA
| | | | - Fritz J Sedlazeck
- Human Genome Sequencing Center, Baylor College of Medicine, One Baylor Plaza, Houston, TX, USA
- Department of Computer Science, Rice University, Houston, TX, USA
| | | | | | - Alaina Shumate
- Department of Biomedical Engineering, Johns Hopkins University, Baltimore, MD, USA
| | | | - Likhitha Surapaneni
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge, UK
| | - Angela M Taravella Oill
- Center for Evolution and Medicine, School of Life Sciences, Arizona State University, Tempe, AZ, USA
| | - Françoise Thibaud-Nissen
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD, USA
| | - Winston Timp
- Department of Biomedical Engineering, Johns Hopkins University, Baltimore, MD, USA
| | - Marta Tomaszkiewicz
- Department of Biology, Pennsylvania State University, University Park, PA, USA
- Department of Biomedical Engineering, Pennsylvania State University, State College, PA, USA
| | - Mitchell R Vollger
- Department of Genome Sciences, University of Washington School of Medicine, Seattle, WA, USA
| | - Brian P Walenz
- Genome Informatics Section, Computational and Statistical Genomics Branch, National Human Genome Research Institute, National Institutes of Health, Bethesda, MD, USA
| | - Allison C Watwood
- Department of Biology, Pennsylvania State University, University Park, PA, USA
| | | | | | - Melissa A Wilson
- Center for Evolution and Medicine, School of Life Sciences, Arizona State University, Tempe, AZ, USA
| | - Samantha Zarate
- Department of Computer Science, Johns Hopkins University, Baltimore, MD, USA
| | - Yiming Zhu
- Human Genome Sequencing Center, Baylor College of Medicine, One Baylor Plaza, Houston, TX, USA
| | - Justin M Zook
- Biosystems and Biomaterials Division, National Institute of Standards and Technology, Gaithersburg, MD, USA
| | - Evan E Eichler
- Department of Genome Sciences, University of Washington School of Medicine, Seattle, WA, USA
- Investigator, Howard Hughes Medical Institute, University of Washington, Seattle, WA, USA
| | - Rachel J O'Neill
- Department of Molecular and Cell Biology, University of Connecticut, Storrs, CT, USA
- Institute for Systems Genomics, University of Connecticut, Storrs, CT, USA
- Department of Genetics and Genome Sciences, UConn Health, Farmington, CT, USA
| | - Michael C Schatz
- Department of Biology, Johns Hopkins University, Baltimore, MD, USA
- Department of Computer Science, Johns Hopkins University, Baltimore, MD, USA
| | - Karen H Miga
- Department of Biomolecular Engineering, University of California Santa Cruz, Santa Cruz, CA, USA
- UC Santa Cruz Genomics Institute, University of California Santa Cruz, Santa Cruz, CA, USA
| | - Kateryna D Makova
- Department of Biology, Pennsylvania State University, University Park, PA, USA
| | - Adam M Phillippy
- Genome Informatics Section, Computational and Statistical Genomics Branch, National Human Genome Research Institute, National Institutes of Health, Bethesda, MD, USA.
| |
Collapse
|
38
|
Abstract
The p-arms of the five human acrocentric chromosomes bear nucleolar organizer regions (NORs) comprising ribosomal gene (rDNA) repeats that are organized in a homogeneous tandem array and transcribed in a telomere-to-centromere direction. Precursor ribosomal RNA transcripts are processed and assembled into ribosomal subunits, the nucleolus being the physical manifestation of this process. I review current understanding of nucleolar chromosome biology and describe current exploration into a role for the NOR chromosomal context. Full DNA sequences for acrocentric p-arms are now emerging, aided by the current revolution in long-read sequencing and genome assembly. Acrocentric p-arms vary from 10.1 to 16.7 Mb, accounting for ∼2.2% of the genome. Bordering rDNA arrays, distal junctions, and proximal junctions are shared among the p-arms, with distal junctions showing evidence of functionality. The remaining p-arm sequences comprise multiple satellite DNA classes and segmental duplications that facilitate recombination between heterologous chromosomes, which is likely also involved in Robertsonian translocations.
Collapse
Affiliation(s)
- Brian McStay
- Centre for Chromosome Biology, College of Science and Engineering, University of Galway, Galway, Ireland;
| |
Collapse
|
39
|
Yang X, Wang X, Zou Y, Zhang S, Xia M, Fu L, Vollger MR, Chen NC, Taylor DJ, Harvey WT, Logsdon GA, Meng D, Shi J, McCoy RC, Schatz MC, Li W, Eichler EE, Lu Q, Mao Y. Characterization of large-scale genomic differences in the first complete human genome. Genome Biol 2023; 24:157. [PMID: 37403156 PMCID: PMC10320979 DOI: 10.1186/s13059-023-02995-w] [Citation(s) in RCA: 8] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/14/2022] [Accepted: 06/23/2023] [Indexed: 07/06/2023] Open
Abstract
BACKGROUND The first telomere-to-telomere (T2T) human genome assembly (T2T-CHM13) release is a milestone in human genomics. The T2T-CHM13 genome assembly extends our understanding of telomeres, centromeres, segmental duplication, and other complex regions. The current human genome reference (GRCh38) has been widely used in various human genomic studies. However, the large-scale genomic differences between these two important genome assemblies are not characterized in detail yet. RESULTS Here, in addition to the previously reported "non-syntenic" regions, we find 67 additional large-scale discrepant regions and precisely categorize them into four structural types with a newly developed website tool called SynPlotter. The discrepant regions (~ 21.6 Mbp) excluding telomeric and centromeric regions are highly structurally polymorphic in humans, where the deletions or duplications are likely associated with various human diseases, such as immune and neurodevelopmental disorders. The analyses of a newly identified discrepant region-the KLRC gene cluster-show that the depletion of KLRC2 by a single-deletion event is associated with natural killer cell differentiation in ~ 20% of humans. Meanwhile, the rapid amino acid replacements observed within KLRC3 are probably a result of natural selection in primate evolution. CONCLUSION Our study provides a foundation for understanding the large-scale structural genomic differences between the two crucial human reference genomes, and is thereby important for future human genomics studies.
Collapse
Affiliation(s)
- Xiangyu Yang
- Bio-X Institutes, Key Laboratory for the Genetics of Developmental and Neuropsychiatric Disorders, Ministry of Education, Shanghai Jiao Tong University, Shanghai, China
| | - Xuankai Wang
- Bio-X Institutes, Key Laboratory for the Genetics of Developmental and Neuropsychiatric Disorders, Ministry of Education, Shanghai Jiao Tong University, Shanghai, China
| | - Yawen Zou
- Bio-X Institutes, Key Laboratory for the Genetics of Developmental and Neuropsychiatric Disorders, Ministry of Education, Shanghai Jiao Tong University, Shanghai, China
| | - Shilong Zhang
- Bio-X Institutes, Key Laboratory for the Genetics of Developmental and Neuropsychiatric Disorders, Ministry of Education, Shanghai Jiao Tong University, Shanghai, China
| | - Manying Xia
- Bio-X Institutes, Key Laboratory for the Genetics of Developmental and Neuropsychiatric Disorders, Ministry of Education, Shanghai Jiao Tong University, Shanghai, China
| | - Lianting Fu
- Bio-X Institutes, Key Laboratory for the Genetics of Developmental and Neuropsychiatric Disorders, Ministry of Education, Shanghai Jiao Tong University, Shanghai, China
| | - Mitchell R Vollger
- Department of Genome Sciences, University of Washington School of Medicine, Seattle, WA, USA
| | - Nae-Chyun Chen
- Department of Computer Science, Johns Hopkins University, Baltimore, MD, USA
| | - Dylan J Taylor
- Department of Biology, Johns Hopkins University, Baltimore, MD, USA
| | - William T Harvey
- Department of Genome Sciences, University of Washington School of Medicine, Seattle, WA, USA
| | - Glennis A Logsdon
- Department of Genome Sciences, University of Washington School of Medicine, Seattle, WA, USA
| | - Dan Meng
- Bio-X Institutes, Key Laboratory for the Genetics of Developmental and Neuropsychiatric Disorders, Ministry of Education, Shanghai Jiao Tong University, Shanghai, China
| | - Junfeng Shi
- Shanghai Engineering Research Center of Advanced Dental Technology and Materials, Shanghai, China
- Shanghai Key Laboratory of Stomatology, Shanghai Ninth People's Hospital, College of Stomatology, Shanghai Jiao Tong University School of Medicine, Shanghai, China
| | - Rajiv C McCoy
- Department of Biology, Johns Hopkins University, Baltimore, MD, USA
| | - Michael C Schatz
- Department of Computer Science, Johns Hopkins University, Baltimore, MD, USA
- Department of Biology, Johns Hopkins University, Baltimore, MD, USA
| | - Weidong Li
- Bio-X Institutes, Key Laboratory for the Genetics of Developmental and Neuropsychiatric Disorders, Ministry of Education, Shanghai Jiao Tong University, Shanghai, China
| | - Evan E Eichler
- Department of Genome Sciences, University of Washington School of Medicine, Seattle, WA, USA
- Howard Hughes Medical Institute, University of Washington, Seattle, WA, USA
| | - Qing Lu
- Bio-X Institutes, Key Laboratory for the Genetics of Developmental and Neuropsychiatric Disorders, Ministry of Education, Shanghai Jiao Tong University, Shanghai, China
| | - Yafei Mao
- Bio-X Institutes, Key Laboratory for the Genetics of Developmental and Neuropsychiatric Disorders, Ministry of Education, Shanghai Jiao Tong University, Shanghai, China.
- Shanghai Key Laboratory of Stomatology, Shanghai Ninth People's Hospital, College of Stomatology, Shanghai Jiao Tong University School of Medicine, Shanghai, China.
| |
Collapse
|
40
|
Laufer VA, Glover TW, Wilson TE. Applications of advanced technologies for detecting genomic structural variation. MUTATION RESEARCH. REVIEWS IN MUTATION RESEARCH 2023; 792:108475. [PMID: 37931775 PMCID: PMC10792551 DOI: 10.1016/j.mrrev.2023.108475] [Citation(s) in RCA: 4] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 04/26/2023] [Revised: 09/07/2023] [Accepted: 11/02/2023] [Indexed: 11/08/2023]
Abstract
Chromosomal structural variation (SV) encompasses a heterogenous class of genetic variants that exerts strong influences on human health and disease. Despite their importance, many structural variants (SVs) have remained poorly characterized at even a basic level, a discrepancy predicated upon the technical limitations of prior genomic assays. However, recent advances in genomic technology can identify and localize SVs accurately, opening new questions regarding SV risk factors and their impacts in humans. Here, we first define and classify human SVs and their generative mechanisms, highlighting characteristics leveraged by various SV assays. We next examine the first-ever gapless assembly of the human genome and the technical process of assembling it, which required third-generation sequencing technologies to resolve structurally complex loci. The new portions of that "telomere-to-telomere" and subsequent pangenome assemblies highlight aspects of SV biology likely to develop in the near-term. We consider the strengths and limitations of the most promising new SV technologies and when they or longstanding approaches are best suited to meeting salient goals in the study of human SV in population-scale genomics research, clinical, and public health contexts. It is a watershed time in our understanding of human SV when new approaches are expected to fundamentally change genomic applications.
Collapse
Affiliation(s)
- Vincent A Laufer
- Department of Pathology, University of Michigan Medical School, Ann Arbor, MI 48109, USA.
| | - Thomas W Glover
- Department of Pathology, University of Michigan Medical School, Ann Arbor, MI 48109, USA; Department of Human Genetics, University of Michigan Medical School, Ann Arbor, MI 48109, USA.
| | - Thomas E Wilson
- Department of Pathology, University of Michigan Medical School, Ann Arbor, MI 48109, USA; Department of Human Genetics, University of Michigan Medical School, Ann Arbor, MI 48109, USA.
| |
Collapse
|
41
|
Pazhenkova EA, Lukhtanov VA. Chromosomal conservatism vs chromosomal megaevolution: enigma of karyotypic evolution in Lepidoptera. Chromosome Res 2023; 31:16. [PMID: 37300756 DOI: 10.1007/s10577-023-09725-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/28/2022] [Revised: 05/21/2023] [Accepted: 05/23/2023] [Indexed: 06/12/2023]
Abstract
In the evolution of many organisms, periods of slow genome reorganization (= chromosomal conservatism) are interrupted by bursts of numerous chromosomal changes (= chromosomal megaevolution). Using comparative analysis of chromosome-level genome assemblies, we investigated these processes in blue butterflies (Lycaenidae). We demonstrate that the phase of chromosome number conservatism is characterized by the stability of most autosomes and dynamic evolution of the sex chromosome Z, resulting in multiple variants of NeoZ chromosomes due to autosome-sex chromosome fusions. In contrast during the phase of rapid chromosomal evolution, the explosive increase in chromosome number occurs mainly through simple chromosomal fissions. We show that chromosomal megaevolution is a highly non-random canalized process, and in two phylogenetically independent Lysandra lineages, the drastic parallel increase in number of fragmented chromosomes was achieved, at least partially, through reuse of the same ancestral chromosomal breakpoints. In species showing chromosome number doubling, we found no blocks of duplicated sequences or duplicated chromosomes, thus refuting the hypothesis of polyploidy. In the studied taxa, long blocks of interstitial telomere sequences (ITSs) consist of (TTAGG)n arrays interspersed with telomere-specific retrotransposons. ITSs are sporadically present in rapidly evolving Lysandra karyotypes, but not in the species with ancestral chromosome number. Therefore, we hypothesize that the transposition of telomeric sequences may be triggers of the rapid chromosome number increase. Finally, we discuss the hypothetical genomic and population mechanisms of chromosomal megaevolution and argue that the disproportionally high evolutionary role of the Z sex chromosome can be additionally reinforced by sex chromosome-autosome fusions and Z-chromosome inversions.
Collapse
Affiliation(s)
- Elena A Pazhenkova
- Department of Biology, Biotechnical Faculty, University of Ljubljana, Večna Pot 111, 1000, Ljubljana, Slovenia.
| | - Vladimir A Lukhtanov
- Department of Karyosystematics, Zoological Institute of Russian Academy of Sciences, Universitetskaya Nab. 1, 199034, St. Petersburg, Russia.
| |
Collapse
|
42
|
Wlodzimierz P, Rabanal FA, Burns R, Naish M, Primetis E, Scott A, Mandáková T, Gorringe N, Tock AJ, Holland D, Fritschi K, Habring A, Lanz C, Patel C, Schlegel T, Collenberg M, Mielke M, Nordborg M, Roux F, Shirsekar G, Alonso-Blanco C, Lysak MA, Novikova PY, Bousios A, Weigel D, Henderson IR. Cycles of satellite and transposon evolution in Arabidopsis centromeres. Nature 2023:10.1038/s41586-023-06062-z. [PMID: 37198485 DOI: 10.1038/s41586-023-06062-z] [Citation(s) in RCA: 46] [Impact Index Per Article: 23.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/17/2022] [Accepted: 04/06/2023] [Indexed: 05/19/2023]
Abstract
Centromeres are critical for cell division, loading CENH3 or CENPA histone variant nucleosomes, directing kinetochore formation and allowing chromosome segregation1,2. Despite their conserved function, centromere size and structure are diverse across species. To understand this centromere paradox3,4, it is necessary to know how centromeric diversity is generated and whether it reflects ancient trans-species variation or, instead, rapid post-speciation divergence. To address these questions, we assembled 346 centromeres from 66 Arabidopsis thaliana and 2 Arabidopsis lyrata accessions, which exhibited a remarkable degree of intra- and inter-species diversity. A. thaliana centromere repeat arrays are embedded in linkage blocks, despite ongoing internal satellite turnover, consistent with roles for unidirectional gene conversion or unequal crossover between sister chromatids in sequence diversification. Additionally, centrophilic ATHILA transposons have recently invaded the satellite arrays. To counter ATHILA invasion, chromosome-specific bursts of satellite homogenization generate higher-order repeats and purge transposons, in line with cycles of repeat evolution. Centromeric sequence changes are even more extreme in comparison between A. thaliana and A. lyrata. Together, our findings identify rapid cycles of transposon invasion and purging through satellite homogenization, which drive centromere evolution and ultimately contribute to speciation.
Collapse
Affiliation(s)
- Piotr Wlodzimierz
- Department of Plant Sciences, University of Cambridge, Cambridge, UK
| | - Fernando A Rabanal
- Department of Molecular Biology, Max Planck Institute for Biology Tübingen, Tübingen, Germany
| | - Robin Burns
- Department of Plant Sciences, University of Cambridge, Cambridge, UK
| | - Matthew Naish
- Department of Plant Sciences, University of Cambridge, Cambridge, UK
| | - Elias Primetis
- School of Life Sciences, University of Sussex, Brighton, UK
| | - Alison Scott
- Department of Chromosome Biology, Max Planck Institute for Plant Breeding Research, Cologne, Germany
| | - Terezie Mandáková
- Central European Institute of Technology, Masaryk University, Brno, Czech Republic
| | - Nicola Gorringe
- Department of Plant Sciences, University of Cambridge, Cambridge, UK
| | - Andrew J Tock
- Department of Plant Sciences, University of Cambridge, Cambridge, UK
| | - Daniel Holland
- Department of Plant Sciences, University of Cambridge, Cambridge, UK
| | - Katrin Fritschi
- Department of Molecular Biology, Max Planck Institute for Biology Tübingen, Tübingen, Germany
| | - Anette Habring
- Department of Molecular Biology, Max Planck Institute for Biology Tübingen, Tübingen, Germany
| | - Christa Lanz
- Department of Molecular Biology, Max Planck Institute for Biology Tübingen, Tübingen, Germany
| | - Christie Patel
- Department of Plant Sciences, University of Cambridge, Cambridge, UK
| | - Theresa Schlegel
- Department of Molecular Biology, Max Planck Institute for Biology Tübingen, Tübingen, Germany
| | - Maximilian Collenberg
- Department of Molecular Biology, Max Planck Institute for Biology Tübingen, Tübingen, Germany
| | - Miriam Mielke
- Department of Molecular Biology, Max Planck Institute for Biology Tübingen, Tübingen, Germany
| | - Magnus Nordborg
- Gregor Mendel Institute, Vienna, Austrian Academy of Sciences, Vienna BioCenter, Vienna, Austria
| | - Fabrice Roux
- LIPME, INRAE, CNRS, Université de Toulouse, Castanet-Tolosan, France
| | - Gautam Shirsekar
- Department of Molecular Biology, Max Planck Institute for Biology Tübingen, Tübingen, Germany
| | - Carlos Alonso-Blanco
- Departamento de Genética Molecular de Plantas, Centro Nacional de Biotecnología, Consejo Superior de Investigaciones Científicas, Madrid, Spain
| | - Martin A Lysak
- Central European Institute of Technology, Masaryk University, Brno, Czech Republic
| | - Polina Y Novikova
- Department of Chromosome Biology, Max Planck Institute for Plant Breeding Research, Cologne, Germany
| | | | - Detlef Weigel
- Department of Molecular Biology, Max Planck Institute for Biology Tübingen, Tübingen, Germany.
| | - Ian R Henderson
- Department of Plant Sciences, University of Cambridge, Cambridge, UK.
| |
Collapse
|
43
|
Olson ND, Wagner J, Dwarshuis N, Miga KH, Sedlazeck FJ, Salit M, Zook JM. Variant calling and benchmarking in an era of complete human genome sequences. Nat Rev Genet 2023:10.1038/s41576-023-00590-0. [PMID: 37059810 DOI: 10.1038/s41576-023-00590-0] [Citation(s) in RCA: 46] [Impact Index Per Article: 23.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 02/22/2023] [Indexed: 04/16/2023]
Abstract
Genetic variant calling from DNA sequencing has enabled understanding of germline variation in hundreds of thousands of humans. Sequencing technologies and variant-calling methods have advanced rapidly, routinely providing reliable variant calls in most of the human genome. We describe how advances in long reads, deep learning, de novo assembly and pangenomes have expanded access to variant calls in increasingly challenging, repetitive genomic regions, including medically relevant regions, and how new benchmark sets and benchmarking methods illuminate their strengths and limitations. Finally, we explore the possible future of more complete characterization of human genome variation in light of the recent completion of a telomere-to-telomere human genome reference assembly and human pangenomes, and we consider the innovations needed to benchmark their newly accessible repetitive regions and complex variants.
Collapse
Affiliation(s)
- Nathan D Olson
- Material Measurement Laboratory, National Institute of Standards and Technology, Gaithersburg, MD, USA
| | - Justin Wagner
- Material Measurement Laboratory, National Institute of Standards and Technology, Gaithersburg, MD, USA
| | - Nathan Dwarshuis
- Material Measurement Laboratory, National Institute of Standards and Technology, Gaithersburg, MD, USA
| | - Karen H Miga
- UC Santa Cruz Genomics Institute, University of California Santa Cruz, Santa Cruz, CA, USA
| | - Fritz J Sedlazeck
- Baylor College of Medicine, Human Genome Sequencing Center, Houston, TX, USA
| | | | - Justin M Zook
- Material Measurement Laboratory, National Institute of Standards and Technology, Gaithersburg, MD, USA.
| |
Collapse
|
44
|
Denti L, Khorsand P, Bonizzoni P, Hormozdiari F, Chikhi R. SVDSS: structural variation discovery in hard-to-call genomic regions using sample-specific strings from accurate long reads. Nat Methods 2023; 20:550-558. [PMID: 36550274 DOI: 10.1038/s41592-022-01674-1] [Citation(s) in RCA: 9] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/23/2022] [Accepted: 10/08/2022] [Indexed: 12/24/2022]
Abstract
Structural variants (SVs) account for a large amount of sequence variability across genomes and play an important role in human genomics and precision medicine. Despite intense efforts over the years, the discovery of SVs in individuals remains challenging due to the diploid and highly repetitive structure of the human genome, and by the presence of SVs that vastly exceed sequencing read lengths. However, the recent introduction of low-error long-read sequencing technologies such as PacBio HiFi may finally enable these barriers to be overcome. Here we present SV discovery with sample-specific strings (SVDSS)-a method for discovery of SVs from long-read sequencing technologies (for example, PacBio HiFi) that combines and effectively leverages mapping-free, mapping-based and assembly-based methodologies for overall superior SV discovery performance. Our experiments on several human samples show that SVDSS outperforms state-of-the-art mapping-based methods for discovery of insertion and deletion SVs in PacBio HiFi reads and achieves notable improvements in calling SVs in repetitive regions of the genome.
Collapse
Affiliation(s)
- Luca Denti
- Sequence Bioinformatics, Department of Computational Biology, Institut Pasteur, Paris, France
| | | | - Paola Bonizzoni
- Department of Informatics, Systems and Communication, University of Milano-Bicocca, Milan, Italy.
| | - Fereydoun Hormozdiari
- Genome Center, UC Davis, Davis, CA, USA.
- UC Davis MIND Institute, Sacramento, CA, USA.
- Department of Biochemistry and Molecular Medicine, Sacramento, UC Davis, Sacramento, CA, USA.
| | - Rayan Chikhi
- Sequence Bioinformatics, Department of Computational Biology, Institut Pasteur, Paris, France.
| |
Collapse
|
45
|
Chang J, Stahlke AR, Chudalayandi S, Rosen BD, Childers AK, Severin AJ. polishCLR: A Nextflow Workflow for Polishing PacBio CLR Genome Assemblies. Genome Biol Evol 2023; 15:7040681. [PMID: 36792366 PMCID: PMC9985148 DOI: 10.1093/gbe/evad020] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/23/2022] [Revised: 02/02/2023] [Accepted: 02/08/2023] [Indexed: 02/17/2023] Open
Abstract
Long-read sequencing has revolutionized genome assembly, yielding highly contiguous, chromosome-level contigs. However, assemblies from some third generation long read technologies, such as Pacific Biosciences (PacBio) continuous long reads (CLR), have a high error rate. Such errors can be corrected with short reads through a process called polishing. Although best practices for polishing non-model de novo genome assemblies were recently described by the Vertebrate Genome Project (VGP) Assembly community, there is a need for a publicly available, reproducible workflow that can be easily implemented and run on a conventional high performance computing environment. Here, we describe polishCLR (https://github.com/isugifNF/polishCLR), a reproducible Nextflow workflow that implements best practices for polishing assemblies made from CLR data. PolishCLR can be initiated from several input options that extend best practices to suboptimal cases. It also provides re-entry points throughout several key processes, including identifying duplicate haplotypes in purge_dups, allowing a break for scaffolding if data are available, and throughout multiple rounds of polishing and evaluation with Arrow and FreeBayes. PolishCLR is containerized and publicly available for the greater assembly community as a tool to complete assemblies from existing, error-prone long-read data.
Collapse
Affiliation(s)
- Jennifer Chang
- USDA, Agricultural Research Service, Jamie Whitten Delta States Research Center, Genomics and Bioinformatics Research Unit, Stoneville, Mississippi.,Oak Ridge Institute for Science and Education, Oak Ridge, Tennessee.,Genome Informatics Facility, Office of Biotechnology, Iowa State University, Ames
| | - Amanda R Stahlke
- USDA, Agricultural Research Service, Beltsville Agricultural Research Center, Bee Research Laboratory, Beltsville Maryland
| | | | - Benjamin D Rosen
- USDA, Agricultural Research Service, Beltsville Agricultural Research Center, Animal Genomics and Improvement Laboratory, Beltsville, Maryland
| | - Anna K Childers
- USDA, Agricultural Research Service, Beltsville Agricultural Research Center, Bee Research Laboratory, Beltsville Maryland
| | - Andrew J Severin
- Genome Informatics Facility, Office of Biotechnology, Iowa State University, Ames
| |
Collapse
|
46
|
Wick RR, Judd LM, Holt KE. Assembling the perfect bacterial genome using Oxford Nanopore and Illumina sequencing. PLoS Comput Biol 2023; 19:e1010905. [PMID: 36862631 PMCID: PMC9980784 DOI: 10.1371/journal.pcbi.1010905] [Citation(s) in RCA: 52] [Impact Index Per Article: 26.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 03/03/2023] Open
Abstract
A perfect bacterial genome assembly is one where the assembled sequence is an exact match for the organism's genome-each replicon sequence is complete and contains no errors. While this has been difficult to achieve in the past, improvements in long-read sequencing, assemblers, and polishers have brought perfect assemblies within reach. Here, we describe our recommended approach for assembling a bacterial genome to perfection using a combination of Oxford Nanopore Technologies long reads and Illumina short reads: Trycycler long-read assembly, Medaka long-read polishing, Polypolish short-read polishing, followed by other short-read polishing tools and manual curation. We also discuss potential pitfalls one might encounter when assembling challenging genomes, and we provide an online tutorial with sample data (github.com/rrwick/perfect-bacterial-genome-tutorial).
Collapse
Affiliation(s)
- Ryan R. Wick
- Department of Infectious Diseases, Central Clinical School, Monash University, Melbourne, Australia
| | - Louise M. Judd
- Department of Microbiology and Immunology, University of Melbourne at the Peter Doherty Institute for Infection and Immunity, Melbourne, Australia
| | - Kathryn E. Holt
- Department of Infectious Diseases, Central Clinical School, Monash University, Melbourne, Australia
- Department of Infection Biology, London School of Hygiene & Tropical Medicine, London, United Kingdom
| |
Collapse
|
47
|
DeepConsensus improves the accuracy of sequences with a gap-aware sequence transformer. Nat Biotechnol 2023; 41:232-238. [PMID: 36050551 DOI: 10.1038/s41587-022-01435-7] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/28/2021] [Accepted: 07/15/2022] [Indexed: 11/08/2022]
Abstract
Circular consensus sequencing with Pacific Biosciences (PacBio) technology generates long (10-25 kilobases), accurate 'HiFi' reads by combining serial observations of a DNA molecule into a consensus sequence. The standard approach to consensus generation, pbccs, uses a hidden Markov model. We introduce DeepConsensus, which uses an alignment-based loss to train a gap-aware transformer-encoder for sequence correction. Compared to pbccs, DeepConsensus reduces read errors by 42%. This increases the yield of PacBio HiFi reads at Q20 by 9%, at Q30 by 27% and at Q40 by 90%. With two SMRT Cells of HG003, reads from DeepConsensus improve hifiasm assembly contiguity (NG50 4.9 megabases (Mb) to 17.2 Mb), increase gene completeness (94% to 97%), reduce the false gene duplication rate (1.1% to 0.5%), improve assembly base accuracy (Q43 to Q45) and reduce variant-calling errors by 24%. DeepConsensus models could be trained to the general problem of analyzing the alignment of other types of sequences, such as unique molecular identifiers or genome assemblies.
Collapse
|
48
|
Secomandi S, Gallo GR, Sozzoni M, Iannucci A, Galati E, Abueg L, Balacco J, Caprioli M, Chow W, Ciofi C, Collins J, Fedrigo O, Ferretti L, Fungtammasan A, Haase B, Howe K, Kwak W, Lombardo G, Masterson P, Messina G, Møller AP, Mountcastle J, Mousseau TA, Ferrer Obiol J, Olivieri A, Rhie A, Rubolini D, Saclier M, Stanyon R, Stucki D, Thibaud-Nissen F, Torrance J, Torroni A, Weber K, Ambrosini R, Bonisoli-Alquati A, Jarvis ED, Gianfranceschi L, Formenti G. A chromosome-level reference genome and pangenome for barn swallow population genomics. Cell Rep 2023; 42:111992. [PMID: 36662619 PMCID: PMC10044405 DOI: 10.1016/j.celrep.2023.111992] [Citation(s) in RCA: 8] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/09/2022] [Revised: 07/20/2022] [Accepted: 01/04/2023] [Indexed: 01/20/2023] Open
Abstract
Insights into the evolution of non-model organisms are limited by the lack of reference genomes of high accuracy, completeness, and contiguity. Here, we present a chromosome-level, karyotype-validated reference genome and pangenome for the barn swallow (Hirundo rustica). We complement these resources with a reference-free multialignment of the reference genome with other bird genomes and with the most comprehensive catalog of genetic markers for the barn swallow. We identify potentially conserved and accelerated genes using the multialignment and estimate genome-wide linkage disequilibrium using the catalog. We use the pangenome to infer core and accessory genes and to detect variants using it as a reference. Overall, these resources will foster population genomics studies in the barn swallow, enable detection of candidate genes in comparative genomics studies, and help reduce bias toward a single reference genome.
Collapse
Affiliation(s)
- Simona Secomandi
- Department of Biosciences, University of Milan, Milan, Italy; Department of Biological Sciences, University of Cyprus, Nicosia, Cyprus
| | - Guido R Gallo
- Department of Biosciences, University of Milan, Milan, Italy
| | | | - Alessio Iannucci
- Department of Biology, University of Florence, Sesto Fiorentino (FI), Italy
| | - Elena Galati
- Department of Biosciences, University of Milan, Milan, Italy
| | - Linelle Abueg
- Vertebrate Genome Laboratory, The Rockefeller University, New York, NY, USA
| | - Jennifer Balacco
- Vertebrate Genome Laboratory, The Rockefeller University, New York, NY, USA
| | - Manuela Caprioli
- Department of Environmental Sciences and Policy, University of Milan, Milan, Italy
| | | | - Claudio Ciofi
- Department of Biology, University of Florence, Sesto Fiorentino (FI), Italy
| | | | - Olivier Fedrigo
- Vertebrate Genome Laboratory, The Rockefeller University, New York, NY, USA
| | - Luca Ferretti
- Department of Biology and Biotechnology "L. Spallanzani", University of Pavia, Pavia, Italy
| | | | - Bettina Haase
- Vertebrate Genome Laboratory, The Rockefeller University, New York, NY, USA
| | | | - Woori Kwak
- Department of Medical and Biological Sciences, The Catholic University of Korea, Bucheon 14662, Korea
| | - Gianluca Lombardo
- Department of Biology and Biotechnology "L. Spallanzani", University of Pavia, Pavia, Italy
| | - Patrick Masterson
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, USA
| | | | - Anders P Møller
- Ecologie Systématique Evolution, Université Paris-Sud, CNRS, AgroParisTech, Université Paris-Saclay, Orsay Cedex, France
| | | | - Timothy A Mousseau
- Department of Biological Sciences, University of South Carolina, Columbia, SC 29208, USA
| | - Joan Ferrer Obiol
- Department of Environmental Sciences and Policy, University of Milan, Milan, Italy
| | - Anna Olivieri
- Department of Biology and Biotechnology "L. Spallanzani", University of Pavia, Pavia, Italy
| | - Arang Rhie
- Genome Informatics Section, Computational and Statistical Genomics Branch, National Human Genome, National Human Genome Research Institute, National Institutes of Health, Bethesda, MD, USA
| | - Diego Rubolini
- Department of Environmental Sciences and Policy, University of Milan, Milan, Italy
| | | | - Roscoe Stanyon
- Department of Biology, University of Florence, Sesto Fiorentino (FI), Italy
| | | | - Françoise Thibaud-Nissen
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, USA
| | | | - Antonio Torroni
- Department of Biology and Biotechnology "L. Spallanzani", University of Pavia, Pavia, Italy
| | | | - Roberto Ambrosini
- Department of Environmental Sciences and Policy, University of Milan, Milan, Italy
| | - Andrea Bonisoli-Alquati
- Department of Biological Sciences, California State Polytechnic University - Pomona, Pomona, CA, USA
| | - Erich D Jarvis
- Vertebrate Genome Laboratory, The Rockefeller University, New York, NY, USA; The Howard Hughes Medical Institute, Chevy Chase, MD, USA
| | | | - Giulio Formenti
- Vertebrate Genome Laboratory, The Rockefeller University, New York, NY, USA.
| |
Collapse
|
49
|
Silva JM, Qi W, Pinho AJ, Pratas D. AlcoR: alignment-free simulation, mapping, and visualization of low-complexity regions in biological data. Gigascience 2022; 12:giad101. [PMID: 38091509 PMCID: PMC10716826 DOI: 10.1093/gigascience/giad101] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/28/2023] [Revised: 09/29/2023] [Accepted: 11/07/2023] [Indexed: 12/18/2023] Open
Abstract
BACKGROUND Low-complexity data analysis is the area that addresses the search and quantification of regions in sequences of elements that contain low-complexity or repetitive elements. For example, these can be tandem repeats, inverted repeats, homopolymer tails, GC-biased regions, similar genes, and hairpins, among many others. Identifying these regions is crucial because of their association with regulatory and structural characteristics. Moreover, their identification provides positional and quantity information where standard assembly methodologies face significant difficulties because of substantial higher depth coverage (mountains), ambiguous read mapping, or where sequencing or reconstruction defects may occur. However, the capability to distinguish low-complexity regions (LCRs) in genomic and proteomic sequences is a challenge that depends on the model's ability to find them automatically. Low-complexity patterns can be implicit through specific or combined sources, such as algorithmic or probabilistic, and recurring to different spatial distances-namely, local, medium, or distant associations. FINDINGS This article addresses the challenge of automatically modeling and distinguishing LCRs, providing a new method and tool (AlcoR) for efficient and accurate segmentation and visualization of these regions in genomic and proteomic sequences. The method enables the use of models with different memories, providing the ability to distinguish local from distant low-complexity patterns. The method is reference and alignment free, providing additional methodologies for testing, including a highly flexible simulation method for generating biological sequences (DNA or protein) with different complexity levels, sequence masking, and a visualization tool for automatic computation of the LCR maps into an ideogram style. We provide illustrative demonstrations using synthetic, nearly synthetic, and natural sequences showing the high efficiency and accuracy of AlcoR. As large-scale results, we use AlcoR to unprecedentedly provide a whole-chromosome low-complexity map of a recent complete human genome and the haplotype-resolved chromosome pairs of a heterozygous diploid African cassava cultivar. CONCLUSIONS The AlcoR method provides the ability of fast sequence characterization through data complexity analysis, ideally for scenarios entangling the presence of new or unknown sequences. AlcoR is implemented in C language using multithreading to increase the computational speed, is flexible for multiple applications, and does not contain external dependencies. The tool accepts any sequence in FASTA format. The source code is freely provided at https://github.com/cobilab/alcor.
Collapse
Affiliation(s)
- Jorge M Silva
- IEETA, Institute of Electronics and Informatics Engineering of Aveiro, and LASI, Intelligent Systems Associate Laboratory, University of Aveiro, Campus Universitário de Santiago, 3810-193 Aveiro, Portugal
- Department of Electronics Telecommunications and Informatics, University of Aveiro, Campus Universitario de Santiago, 3810-193, Aveiro, Portugal
| | - Weihong Qi
- Functional Genomics Center Zurich, ETH Zurich and University of Zurich, Winterthurerstrasse, 190, 8057, Zurich, Switzerland
- SIB, Swiss Institute of Bioinformatics, 1202, Geneva, Switzerland
| | - Armando J Pinho
- IEETA, Institute of Electronics and Informatics Engineering of Aveiro, and LASI, Intelligent Systems Associate Laboratory, University of Aveiro, Campus Universitário de Santiago, 3810-193 Aveiro, Portugal
- Department of Electronics Telecommunications and Informatics, University of Aveiro, Campus Universitario de Santiago, 3810-193, Aveiro, Portugal
| | - Diogo Pratas
- IEETA, Institute of Electronics and Informatics Engineering of Aveiro, and LASI, Intelligent Systems Associate Laboratory, University of Aveiro, Campus Universitário de Santiago, 3810-193 Aveiro, Portugal
- Department of Electronics Telecommunications and Informatics, University of Aveiro, Campus Universitario de Santiago, 3810-193, Aveiro, Portugal
- Department of Virology, University of Helsinki, Haartmaninkatu, 3, 00014 Helsinki, Finland
| |
Collapse
|
50
|
Wang H, Xu D, Wang S, Wang A, Lei L, Jiang F, Yang B, Yuan L, Chen R, Zhang Y, Fan W. Chromosome-scale Amaranthus tricolor genome provides insights into the evolution of the genus Amaranthus and the mechanism of betalain biosynthesis. DNA Res 2022; 30:6880148. [PMID: 36473054 PMCID: PMC9847342 DOI: 10.1093/dnares/dsac050] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/21/2022] [Revised: 11/25/2022] [Accepted: 12/01/2022] [Indexed: 12/12/2022] Open
Abstract
Amaranthus tricolor is a vegetable and ornamental amaranth, with high lysine, dietary fibre and squalene content. The red cultivar of A. tricolor possesses a high concentration of betalains, which has been used as natural food colorants. Here, we constructed the genome of A. tricolor, the first reference genome for the subgenus Albersia, combining PacBio HiFi, Nanopore ultra-long and Hi-C data. The contig N50 size was 906 kb, and 99.58% of contig sequence was anchored to the 17 chromosomes, totalling 520 Mb. We annotated 27,813 protein-coding genes with an average 1.3 kb coding sequence and 5.3 exons. We inferred that A. tricolor underwent a whole-genome duplication (WGD) and that the WGD shared by amaranths occurred in the last common ancestor of subfamily Amaranthoideae. Moreover, we comprehensively identified candidate genes in betalain biosynthesis pathway. Among them, DODAα1 and CYP76ADα1, located in one topologically associated domain (TAD) of an active (A) compartment on chromosome 16, were more highly expressed in red leaves than in green leaves, and DODAα1 might be the rate-limiting enzyme gene in betalains biosynthesis. This study presents new genome resources and enriches our understanding of amaranth evolution, betalains production, facilitating molecular breeding improvements and the understanding of C4 plants evolution.
Collapse
Affiliation(s)
| | | | - Sen Wang
- Guangdong Laboratory for Lingnan Modern Agriculture (Shenzhen Branch), Genome Analysis Laboratory of the Ministry of Agriculture and Rural Affairs, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences, Shenzhen, Guangdong 518120, China
| | - Anqi Wang
- Guangdong Laboratory for Lingnan Modern Agriculture (Shenzhen Branch), Genome Analysis Laboratory of the Ministry of Agriculture and Rural Affairs, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences, Shenzhen, Guangdong 518120, China
| | - Lihong Lei
- Guangdong Laboratory for Lingnan Modern Agriculture (Shenzhen Branch), Genome Analysis Laboratory of the Ministry of Agriculture and Rural Affairs, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences, Shenzhen, Guangdong 518120, China
| | - Fan Jiang
- Guangdong Laboratory for Lingnan Modern Agriculture (Shenzhen Branch), Genome Analysis Laboratory of the Ministry of Agriculture and Rural Affairs, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences, Shenzhen, Guangdong 518120, China
| | - Boyuan Yang
- Guangdong Laboratory for Lingnan Modern Agriculture (Shenzhen Branch), Genome Analysis Laboratory of the Ministry of Agriculture and Rural Affairs, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences, Shenzhen, Guangdong 518120, China
| | - Lihua Yuan
- Guangdong Laboratory for Lingnan Modern Agriculture (Shenzhen Branch), Genome Analysis Laboratory of the Ministry of Agriculture and Rural Affairs, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences, Shenzhen, Guangdong 518120, China
| | - Rong Chen
- Guangdong Laboratory for Lingnan Modern Agriculture (Shenzhen Branch), Genome Analysis Laboratory of the Ministry of Agriculture and Rural Affairs, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences, Shenzhen, Guangdong 518120, China
| | - Yan Zhang
- Guangdong Laboratory for Lingnan Modern Agriculture (Shenzhen Branch), Genome Analysis Laboratory of the Ministry of Agriculture and Rural Affairs, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences, Shenzhen, Guangdong 518120, China
| | - Wei Fan
- To whom correspondence should be addressed. Tel. +86 18165787021.
| |
Collapse
|