1
|
Espinosa E, Bautista R, Larrosa R, Plata O. Advancements in long-read genome sequencing technologies and algorithms. Genomics 2024; 116:110842. [PMID: 38608738 DOI: 10.1016/j.ygeno.2024.110842] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/31/2023] [Revised: 04/01/2024] [Accepted: 04/06/2024] [Indexed: 04/14/2024]
Abstract
The recent advent of long read sequencing technologies, such as Pacific Biosciences (PacBio) and Oxford Nanopore technology (ONT), have led to substantial improvements in accuracy and computational cost in sequencing genomes. However, de novo whole-genome assembly still presents significant challenges related to the quality of the results. Pursuing de novo whole-genome assembly remains a formidable challenge, underscored by intricate considerations surrounding computational demands and result quality. As sequencing accuracy and throughput steadily advance, a continuous stream of innovative assembly tools floods the field. Navigating this dynamic landscape necessitates a reasonable choice of sequencing platform, depth, and assembly tools to orchestrate high-quality genome reconstructions. This comprehensive review delves into the intricate interplay between cutting-edge long read sequencing technologies, assembly methodologies, and the ever-evolving field of genomics. With a focus on addressing the pivotal challenges and harnessing the opportunities presented by these advancements, we provide an in-depth exploration of the crucial factors influencing the selection of optimal strategies for achieving robust and insightful genome assemblies.
Collapse
Affiliation(s)
- Elena Espinosa
- Department of Computer Architecture, University of Malaga, Louis Pasteur, 35, Campus de Teatinos, Malaga 29071, Spain.
| | - Rocio Bautista
- Supercomputing and Bioinnovation Center, University of Malaga, C. Severo Ochoa, 34, Malaga 29590, Spain.
| | - Rafael Larrosa
- Department of Computer Architecture, University of Malaga, Louis Pasteur, 35, Campus de Teatinos, Malaga 29071, Spain; Supercomputing and Bioinnovation Center, University of Malaga, C. Severo Ochoa, 34, Malaga 29590, Spain.
| | - Oscar Plata
- Department of Computer Architecture, University of Malaga, Louis Pasteur, 35, Campus de Teatinos, Malaga 29071, Spain.
| |
Collapse
|
2
|
Feng Z, Zheng Y, Jiang Y, Pei J, Huang L. Phylogenetic relationships, selective pressure and molecular markers development of six species in subfamily Polygonoideae based on complete chloroplast genomes. Sci Rep 2024; 14:9783. [PMID: 38684694 PMCID: PMC11059183 DOI: 10.1038/s41598-024-58934-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/08/2023] [Accepted: 04/04/2024] [Indexed: 05/02/2024] Open
Abstract
The subfamily Polygonoideae encompasses a diverse array of medicinal and horticultural plants that hold significant economic value. However, due to the lack of a robust taxonomy based on phylogenetic relationships, the classification within this family is perplexing, and there is also a scarcity of reports on the chloroplast genomes of many plants falling under this classification. In this study, we conducted a comprehensive analysis by sequencing and characterizing the complete chloroplast genomes of six Polygonoideae plants, namely Pteroxygonum denticulatum, Pleuropterus multiflorus, Pleuropterus ciliinervis, Fallopia aubertii, Fallopia dentatoalata, and Fallopia convolvulus. Our findings revealed that these six plants possess chloroplast genomes with a typical quadripartite structure, averaging 162,931 bp in length. Comparative chloroplast analysis, codon usage analysis, and repetitive sequence analysis demonstrated a high level of conservation within the chloroplast genomes of these plants. Furthermore, phylogenetic analysis unveiled a distinct clade occupied by P. denticulatum, while P. ciliinrvis displayed a closer relationship to the three plants belonging to the Fallopia genus. Selective pressure analysis based on maximum likelihood trees showed that a total of 14 protein-coding genes exhibited positive selection, with psbB and ycf1 having the highest number of positive amino acid sites. Additionally, we identified four molecular markers, namely petN-psbM, psal-ycf4, ycf3-trnS-GGA, and trnL-UAG-ccsA, which exhibit high variability and can be utilized for the identification of these six plants.
Collapse
Affiliation(s)
- Zhan Feng
- Key Laboratory of Chinese Medicine Resources Conservation, State Administration of Traditional Chinese Medicine of the People's Republic of China, Institute of Medicinal Plant Development, Chinese Academy of Medical Sciences & Peking Union Medical College, Beijing, 100193, China
- State Key Laboratory of Southwestern Chinese Medicine Resources, Chengdu University of Traditional Chinese Medicine, Chengdu, 611137, Sichuan, China
| | - Yan Zheng
- Key Laboratory of Chinese Medicine Resources Conservation, State Administration of Traditional Chinese Medicine of the People's Republic of China, Institute of Medicinal Plant Development, Chinese Academy of Medical Sciences & Peking Union Medical College, Beijing, 100193, China
| | - Yuan Jiang
- Key Laboratory of Chinese Medicine Resources Conservation, State Administration of Traditional Chinese Medicine of the People's Republic of China, Institute of Medicinal Plant Development, Chinese Academy of Medical Sciences & Peking Union Medical College, Beijing, 100193, China
| | - Jin Pei
- State Key Laboratory of Southwestern Chinese Medicine Resources, Chengdu University of Traditional Chinese Medicine, Chengdu, 611137, Sichuan, China.
| | - Linfang Huang
- Key Laboratory of Chinese Medicine Resources Conservation, State Administration of Traditional Chinese Medicine of the People's Republic of China, Institute of Medicinal Plant Development, Chinese Academy of Medical Sciences & Peking Union Medical College, Beijing, 100193, China.
| |
Collapse
|
3
|
Singh G, Alser M, Denolf K, Firtina C, Khodamoradi A, Cavlak MB, Corporaal H, Mutlu O. RUBICON: a framework for designing efficient deep learning-based genomic basecallers. Genome Biol 2024; 25:49. [PMID: 38365730 PMCID: PMC10870431 DOI: 10.1186/s13059-024-03181-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/24/2023] [Accepted: 02/02/2024] [Indexed: 02/18/2024] Open
Abstract
Nanopore sequencing generates noisy electrical signals that need to be converted into a standard string of DNA nucleotide bases using a computational step called basecalling. The performance of basecalling has critical implications for all later steps in genome analysis. Therefore, there is a need to reduce the computation and memory cost of basecalling while maintaining accuracy. We present RUBICON, a framework to develop efficient hardware-optimized basecallers. We demonstrate the effectiveness of RUBICON by developing RUBICALL, the first hardware-optimized mixed-precision basecaller that performs efficient basecalling, outperforming the state-of-the-art basecallers. We believe RUBICON offers a promising path to develop future hardware-optimized basecallers.
Collapse
Affiliation(s)
- Gagandeep Singh
- Department of Information Technology and Electrical Engineering, ETH Zürich, Zürich, Switzerland
- Research and Advanced Development, AMD, Longmont, USA
| | - Mohammed Alser
- Department of Information Technology and Electrical Engineering, ETH Zürich, Zürich, Switzerland
| | | | - Can Firtina
- Department of Information Technology and Electrical Engineering, ETH Zürich, Zürich, Switzerland.
| | | | - Meryem Banu Cavlak
- Department of Information Technology and Electrical Engineering, ETH Zürich, Zürich, Switzerland
| | - Henk Corporaal
- Department of Electrical Engineering, Eindhoven University of Technology, Eindhoven, The Netherlands
| | - Onur Mutlu
- Department of Information Technology and Electrical Engineering, ETH Zürich, Zürich, Switzerland.
| |
Collapse
|
4
|
Zhu F, Yin ZT, Zhao QS, Sun YX, Jie YC, Smith J, Yang YZ, Burt DW, Hincke M, Zhang ZD, Yuan MD, Kaufman J, Sun CJ, Li JY, Shao LW, Yang N, Hou ZC. A chromosome-level genome assembly for the Silkie chicken resolves complete sequences for key chicken metabolic, reproductive, and immunity genes. Commun Biol 2023; 6:1233. [PMID: 38057566 PMCID: PMC10700341 DOI: 10.1038/s42003-023-05619-y] [Citation(s) in RCA: 3] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/03/2022] [Accepted: 11/21/2023] [Indexed: 12/08/2023] Open
Abstract
A set of high-quality pan-genomes would help identify important genes that are still hidden/incomplete in bird reference genomes. In an attempt to address these issues, we have assembled a de novo chromosome-level reference genome of the Silkie (Gallus gallus domesticus), which is an important avian model for unique traits, like fibromelanosis, with unclear genetic foundation. This Silkie genome includes the complete genomic sequences of well-known, but unresolved, evolutionarily, endocrinologically, and immunologically important genes, including leptin, ovocleidin-17, and tumor-necrosis factor-α. The gap-less and manually annotated MHC (major histocompatibility complex) region possesses 38 recently identified genes, with differentially regulated genes recovered in response to pathogen challenges. We also provide whole-genome methylation and genetic variation maps, and resolve a complex genetic region that may contribute to fibromelanosis in these animals. Finally, we experimentally show leptin binding to the identified leptin receptor in chicken, confirming an active leptin ligand-receptor system. The Silkie genome assembly not only provides a rich data resource for avian genome studies, but also lays a foundation for further functional validation of resolved genes.
Collapse
Affiliation(s)
- Feng Zhu
- National Engineering Laboratory for Animal Breeding and Key Laboratory of Animal Genetics, Breeding and Reproduction, MARA; College of Animal Science and Technology, China Agricultural University, No. 2 Yuanmingyuan West Rd, 100193, Beijing, China
| | - Zhong-Tao Yin
- National Engineering Laboratory for Animal Breeding and Key Laboratory of Animal Genetics, Breeding and Reproduction, MARA; College of Animal Science and Technology, China Agricultural University, No. 2 Yuanmingyuan West Rd, 100193, Beijing, China
| | - Qiang-Sen Zhao
- National Engineering Laboratory for Animal Breeding and Key Laboratory of Animal Genetics, Breeding and Reproduction, MARA; College of Animal Science and Technology, China Agricultural University, No. 2 Yuanmingyuan West Rd, 100193, Beijing, China
| | - Yun-Xiao Sun
- National Engineering Laboratory for Animal Breeding and Key Laboratory of Animal Genetics, Breeding and Reproduction, MARA; College of Animal Science and Technology, China Agricultural University, No. 2 Yuanmingyuan West Rd, 100193, Beijing, China
| | - Yu-Chen Jie
- National Engineering Laboratory for Animal Breeding and Key Laboratory of Animal Genetics, Breeding and Reproduction, MARA; College of Animal Science and Technology, China Agricultural University, No. 2 Yuanmingyuan West Rd, 100193, Beijing, China
| | - Jacqueline Smith
- The Roslin Institute & R(D)SVS, University of Edinburgh, Easter Bush, Midlothian, EH25 9RG, UK
| | - Yu-Ze Yang
- Beijing General Station of Animal Husbandry, 100101, Beijing, China
| | - David W Burt
- The Roslin Institute & R(D)SVS, University of Edinburgh, Easter Bush, Midlothian, EH25 9RG, UK
- The University of Queensland, St. Lucia, QLD, 4072, Australia
| | - Maxwell Hincke
- Department of Cellular and Molecular Medicine, Department of Innovation in Medical Education, Faculty of Medicine, University of Ottawa, 451 Smyth Road, Ottawa, KIH 8M5, Canada
| | - Zi-Ding Zhang
- College of Biological Sciences, China Agricultural University, 100193, Beijing, China
| | - Meng-Di Yuan
- College of Biological Sciences, China Agricultural University, 100193, Beijing, China
| | - Jim Kaufman
- Institute for Immunology and Infection Research, University of Edinburgh, Edinburgh, EH9 3FL, UK
- Department of Pathology, University of Cambridge, Cambridge, CB2 1QP, UK
| | - Cong-Jiao Sun
- National Engineering Laboratory for Animal Breeding and Key Laboratory of Animal Genetics, Breeding and Reproduction, MARA; College of Animal Science and Technology, China Agricultural University, No. 2 Yuanmingyuan West Rd, 100193, Beijing, China
| | - Jun-Ying Li
- National Engineering Laboratory for Animal Breeding and Key Laboratory of Animal Genetics, Breeding and Reproduction, MARA; College of Animal Science and Technology, China Agricultural University, No. 2 Yuanmingyuan West Rd, 100193, Beijing, China
| | - Li-Wa Shao
- National Engineering Laboratory for Animal Breeding and Key Laboratory of Animal Genetics, Breeding and Reproduction, MARA; College of Animal Science and Technology, China Agricultural University, No. 2 Yuanmingyuan West Rd, 100193, Beijing, China.
| | - Ning Yang
- National Engineering Laboratory for Animal Breeding and Key Laboratory of Animal Genetics, Breeding and Reproduction, MARA; College of Animal Science and Technology, China Agricultural University, No. 2 Yuanmingyuan West Rd, 100193, Beijing, China.
| | - Zhuo-Cheng Hou
- National Engineering Laboratory for Animal Breeding and Key Laboratory of Animal Genetics, Breeding and Reproduction, MARA; College of Animal Science and Technology, China Agricultural University, No. 2 Yuanmingyuan West Rd, 100193, Beijing, China.
- Sanya Institute of China Agricultural University, Beijing, China.
| |
Collapse
|
5
|
Lee J, Kim M, Han K, Yoon S. StringFix: an annotation-guided transcriptome assembler improves the recovery of amino acid sequences from RNA-Seq reads. Genes Genomics 2023; 45:1599-1609. [PMID: 37837515 DOI: 10.1007/s13258-023-01458-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/16/2023] [Accepted: 10/01/2023] [Indexed: 10/16/2023]
Abstract
BACKGROUND Reconstruction of amino acid sequences from assembled transcriptome is of interest in personalized medicine, for example, to predict drug-target (or protein-protein) interaction considering individual's genomic variations. Most of the existing transcriptome assemblers, however, seems not well suited for this purpose. METHODS In this work, we present StringFix, an annotation guided transcriptome assembly and protein sequence reconstruction software tool that takes genome-aligned reads and the annotations associated to the reference genome as input. The tool 'fixes' the pre-annotated transcript sequence by taking small variations into account, finally to produce possible amino acid sequences that are likely to exist in the test tissue. RESULTS The results show that, using outputs from existing reference-based assemblers as the input GTF-guide, StringFix could reconstruct amino acid sequences more precisely with higher sensitivity than direct generation using the recovered transcripts from all the assemblers we tested. CONCLUSION By using StringFix with the existing reference-based assemblers, one can recover not only a novel transcripts and isoforms but also the possible amino acid sequence stemming from them.
Collapse
Affiliation(s)
- Joongho Lee
- Dept. of Computer Science, College of SW Convergence, Dankook Univ, Yongin-si, 16890, Korea
| | - Minsoo Kim
- Dept. of Computer Science, College of SW Convergence, Dankook Univ, Yongin-si, 16890, Korea
| | - Kyudong Han
- Center for Bio-Medical Engineering Core Facility, Dankook Univ, Cheonan, 31116, Korea
- Dept. of Microbiology, College of Science & Technology, Dankook Univ, Cheonan, 31116, Korea
- HuNbiome Co., Ltd, R&D Center, Seoul, 08503, Korea
| | - Seokhyun Yoon
- Dept. of Electronics and Electrical Engineering, College of Engineering, Dankook Univ, Yongin-si, 16890, Korea.
| |
Collapse
|
6
|
Yu R, Abdullah SMU, Sun Y. HMMPolish: a coding region polishing tool for TGS-sequenced RNA viruses. Brief Bioinform 2023; 24:bbad264. [PMID: 37478372 PMCID: PMC10516367 DOI: 10.1093/bib/bbad264] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/25/2023] [Revised: 06/05/2023] [Accepted: 06/29/2023] [Indexed: 07/23/2023] Open
Abstract
Access to accurate viral genomes is important to downstream data analysis. Third-generation sequencing (TGS) has recently become a popular platform for virus sequencing because of its long read length. However, its per-base error rate, which is higher than next-generation sequencing, can lead to genomes with errors. Polishing tools are thus needed to correct errors either before or after sequence assembly. Despite promising results of available polishing tools, there is still room to improve the error correction performance to perform more accurate genome assembly. The errors, particularly those in coding regions, can hamper analysis such as linage identification and variant monitoring. In this work, we developed a novel pipeline, HMMPolish, for correcting (polishing) errors in protein-coding regions of known RNA viruses. This tool can be applied to either raw TGS reads or the assembled sequences of the target virus. By utilizing profile Hidden Markov Models of protein families/domains in known viruses, HMMPolish can correct errors that are ignored by available polishers. We extensively validated HMMPolish on 34 datasets that covered four clinically important viruses, including HIV-1, influenza-A, norovirus, and severe acute respiratory syndrome coronavirus 2. These datasets contain reads with different properties, such as sequencing depth and platforms (PacBio or Nanopore). The benchmark results against popular/representative polishers show that HMMPolish competes favorably on error correction in coding regions of known RNA viruses.
Collapse
Affiliation(s)
- Runzhou Yu
- Electrical Engineering, City University of Hong Kong, Tat Chee Avenue, Kowloon, Hong Kong, China
| | | | - Yanni Sun
- Electrical Engineering, City University of Hong Kong, Tat Chee Avenue, Kowloon, Hong Kong, China
| |
Collapse
|
7
|
Firtina C, Park J, Alser M, Kim JS, Cali D, Shahroodi T, Ghiasi N, Singh G, Kanellopoulos K, Alkan C, Mutlu O. BLEND: a fast, memory-efficient and accurate mechanism to find fuzzy seed matches in genome analysis. NAR Genom Bioinform 2023; 5:lqad004. [PMID: 36685727 PMCID: PMC9853099 DOI: 10.1093/nargab/lqad004] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/29/2022] [Revised: 12/16/2022] [Accepted: 01/10/2023] [Indexed: 01/22/2023] Open
Abstract
Generating the hash values of short subsequences, called seeds, enables quickly identifying similarities between genomic sequences by matching seeds with a single lookup of their hash values. However, these hash values can be used only for finding exact-matching seeds as the conventional hashing methods assign distinct hash values for different seeds, including highly similar seeds. Finding only exact-matching seeds causes either (i) increasing the use of the costly sequence alignment or (ii) limited sensitivity. We introduce BLEND, the first efficient and accurate mechanism that can identify both exact-matching and highly similar seeds with a single lookup of their hash values, called fuzzy seed matches. BLEND (i) utilizes a technique called SimHash, that can generate the same hash value for similar sets, and (ii) provides the proper mechanisms for using seeds as sets with the SimHash technique to find fuzzy seed matches efficiently. We show the benefits of BLEND when used in read overlapping and read mapping. For read overlapping, BLEND is faster by 2.4×-83.9× (on average 19.3×), has a lower memory footprint by 0.9×-14.1× (on average 3.8×), and finds higher quality overlaps leading to accurate de novo assemblies than the state-of-the-art tool, minimap2. For read mapping, BLEND is faster by 0.8×-4.1× (on average 1.7×) than minimap2. Source code is available at https://github.com/CMU-SAFARI/BLEND.
Collapse
Affiliation(s)
| | - Jisung Park
- ETH Zurich, Zurich 8092, Switzerland
- POSTECH, Pohang 37673, Republic of Korea
| | | | | | | | | | | | | | | | - Can Alkan
- Bilkent University, Ankara 06800, Turkey
| | | |
Collapse
|
8
|
Comparative Genomics and Phylogenetic Analysis of the Chloroplast Genomes in Three Medicinal Salvia Species for Bioexploration. Int J Mol Sci 2022; 23:ijms232012080. [PMID: 36292964 PMCID: PMC9603726 DOI: 10.3390/ijms232012080] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/13/2022] [Revised: 09/08/2022] [Accepted: 09/26/2022] [Indexed: 11/17/2022] Open
Abstract
To systematically determine their phylogenetic relationships and develop molecular markers for species discrimination of Salvia bowleyana, S. splendens, and S. officinalis, we sequenced their chloroplast genomes using the Illumina Hiseq 2500 platform. The chloroplast genomes length of S. bowleyana, S. splendens, and S. officinalis were 151,387 bp, 150,604 bp, and 151,163 bp, respectively. The six genes ndhB, rpl2, rpl23, rps7, rps12, and ycf2 were present in the IR regions. The chloroplast genomes of S. bowleyana, S. splendens, and S. officinalis contain 29 tandem repeats; 35, 29, 24 simple-sequence repeats, and 47, 49, 40 interspersed repeats, respectively. The three specific intergenic sequences (IGS) of rps16-trnQ-UUG, trnL-UAA-trnF-GAA, and trnM-CAU-atpE were found to discriminate the 23 Salvia species. A total of 91 intergenic spacer sequences were identified through genetic distance analysis. The two specific IGS regions (trnG-GCC-trnM-CAU and ycf3-trnS-GGA) have the highest K2p value identified in the three studied Salvia species. Furthermore, the phylogenetic tree showed that the 23 Salvia species formed a monophyletic group. Two pairs of genus-specific DNA barcode primers were found. The results will provide a solid foundation to understand the phylogenetic classification of the three Salvia species. Moreover, the specific intergenic regions can provide the probability to discriminate the Salvia species between the phenotype and the distinction of gene fragments.
Collapse
|
9
|
Du Q, Li J, Wang L, Chen H, Jiang M, Chen Z, Jiang C, Gao H, Wang B, Liu C. Complete chloroplast genomes of two medicinal Swertia species: the comparative evolutionary analysis of Swertia genus in the Gentianaceae family. PLANTA 2022; 256:73. [PMID: 36083348 DOI: 10.1007/s00425-022-03987-z] [Citation(s) in RCA: 5] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 07/25/2022] [Accepted: 08/29/2022] [Indexed: 06/15/2023]
Abstract
The complete chloroplast genome of Swertia kouitchensis has been sequenced and assembled, compared with that of S. bimaculata to determine the evolutionary relationships among species of the Swertia in the Gentianaceae family. Swertia kouitchensis and S. bimaculata are from the Gentianaceae family. The complete chloroplast genome of S. kouitchensis was newly assembled, annotated, and analyzed by Illumina Hiseq 2500 platform. The chloroplast genomes of the two species encoded a total of 133, 134 genes, which included 88-89 protein-coding genes, 37 transfer RNA (tRNA) genes, and 8 ribosomal RNA genes. One intron was contained in each of the eight protein-coding genes and eight tRNA-coding genes, whereas two introns were found in two genes (ycf3 and clpP). The most abundant codon of the two species was for isoleucine, and the least abundant codon was for cysteine. The number of microsatellite repeat sequences was twenty-eight and thirty-two identified in the chloroplast genomes of S. kouitchensis and S. bimaculata, respectively. A total of 1127 repeat sequences were identified in all the 23 Swertia chloroplast genomes, and they fell into four categories. Furthermore, five divergence hotspot regions can be applied to discriminate these 23 Swertia species through genomes comparison. One pair of genus-specific DNA barcodes primer has been accurately identified. Therefore, the diverse regions cloned by a specific primer may become an effective and powerful molecular marker for the identification of Swertia genus. Moreover, four genes (ccsA, ndhK, rpoC1, and rps12) were positive selective pressure. The phylogenetic tree showed that the 23 Swertia species were clustered into a large clade including four evident subbranches, whereas the two species of S. kouitchensis and S. bimaculata were separately clustered into the diverse but correlated species group.
Collapse
Affiliation(s)
- Qing Du
- Institute of Medicinal Plant Development, Chinese Academy of Medical Sciences, Peking Union Medical College, No. 151, Malianwa North Road, Hai Dian District, Beijing, 100193, People's Republic of China.
- College of Pharmacy, Key Laboratory of Medicinal Plant Resources of Qinghai-Tibetan Plateau in Qinghai Province, Qinghai Minzu University, No.3, Bayi Mid-road, Chengdong District, Xining City, Qinghai Province, 810007, People's Republic of China.
- Fresh Sky-Right (Beijing) International Science and Technology Co., Ltd, No.59, Banjing Road, Haidian District, Beijing, 100097, People's Republic of China.
| | - Jing Li
- Institute of Medicinal Plant Development, Chinese Academy of Medical Sciences, Peking Union Medical College, No. 151, Malianwa North Road, Hai Dian District, Beijing, 100193, People's Republic of China
- Xiangnan University, No. 889, Chenzhou dadao, Chenzhou City, Hunan Province, 423000, People's Republic of China
| | - Liqiang Wang
- College of Pharmacy, Heze University, No.2269, University Road, Mudan District, Heze City, Shandong Province, 274015, People's Republic of China
| | - Haimei Chen
- Institute of Medicinal Plant Development, Chinese Academy of Medical Sciences, Peking Union Medical College, No. 151, Malianwa North Road, Hai Dian District, Beijing, 100193, People's Republic of China
| | - Mei Jiang
- Institute of Medicinal Plant Development, Chinese Academy of Medical Sciences, Peking Union Medical College, No. 151, Malianwa North Road, Hai Dian District, Beijing, 100193, People's Republic of China
- School of Pharmaceutical Sciences, Qilu University of Technology (Shandong Academy of Sciences), No. 3501, University Road, Changqing District, Jinan City, Shandong Province, 250399, People's Republic of China
| | - Zhuoer Chen
- Institute of Medicinal Plant Development, Chinese Academy of Medical Sciences, Peking Union Medical College, No. 151, Malianwa North Road, Hai Dian District, Beijing, 100193, People's Republic of China
- Xiangnan University, No. 889, Chenzhou dadao, Chenzhou City, Hunan Province, 423000, People's Republic of China
| | - Chuanbei Jiang
- Genepioneer Biotechnologies Inc, No. 9, Weidi Road, Qixia District, Nanjing City, Jiangsu Province, 210000, People's Republic of China
| | - Haidong Gao
- Genepioneer Biotechnologies Inc, No. 9, Weidi Road, Qixia District, Nanjing City, Jiangsu Province, 210000, People's Republic of China
| | - Bin Wang
- Xiangnan University, No. 889, Chenzhou dadao, Chenzhou City, Hunan Province, 423000, People's Republic of China.
| | - Chang Liu
- Institute of Medicinal Plant Development, Chinese Academy of Medical Sciences, Peking Union Medical College, No. 151, Malianwa North Road, Hai Dian District, Beijing, 100193, People's Republic of China.
| |
Collapse
|
10
|
Alser M, Lindegger J, Firtina C, Almadhoun N, Mao H, Singh G, Gomez-Luna J, Mutlu O. From molecules to genomic variations: Accelerating genome analysis via intelligent algorithms and architectures. Comput Struct Biotechnol J 2022; 20:4579-4599. [PMID: 36090814 PMCID: PMC9436709 DOI: 10.1016/j.csbj.2022.08.019] [Citation(s) in RCA: 7] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/12/2022] [Revised: 08/08/2022] [Accepted: 08/08/2022] [Indexed: 02/01/2023] Open
Abstract
We now need more than ever to make genome analysis more intelligent. We need to read, analyze, and interpret our genomes not only quickly, but also accurately and efficiently enough to scale the analysis to population level. There currently exist major computational bottlenecks and inefficiencies throughout the entire genome analysis pipeline, because state-of-the-art genome sequencing technologies are still not able to read a genome in its entirety. We describe the ongoing journey in significantly improving the performance, accuracy, and efficiency of genome analysis using intelligent algorithms and hardware architectures. We explain state-of-the-art algorithmic methods and hardware-based acceleration approaches for each step of the genome analysis pipeline and provide experimental evaluations. Algorithmic approaches exploit the structure of the genome as well as the structure of the underlying hardware. Hardware-based acceleration approaches exploit specialized microarchitectures or various execution paradigms (e.g., processing inside or near memory) along with algorithmic changes, leading to new hardware/software co-designed systems. We conclude with a foreshadowing of future challenges, benefits, and research directions triggered by the development of both very low cost yet highly error prone new sequencing technologies and specialized hardware chips for genomics. We hope that these efforts and the challenges we discuss provide a foundation for future work in making genome analysis more intelligent.
Collapse
Affiliation(s)
| | | | - Can Firtina
- ETH Zurich, Gloriastrasse 35, 8092 Zürich, Switzerland
| | | | - Haiyu Mao
- ETH Zurich, Gloriastrasse 35, 8092 Zürich, Switzerland
| | | | | | - Onur Mutlu
- ETH Zurich, Gloriastrasse 35, 8092 Zürich, Switzerland
| |
Collapse
|
11
|
Huang F, Xiao L, Gao M, Vallely EJ, Dybvig K, Atkinson TP, Waites KB, Chong Z. B-assembler: a circular bacterial genome assembler. BMC Genomics 2022; 23:361. [PMID: 35546658 PMCID: PMC9092672 DOI: 10.1186/s12864-022-08577-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/21/2022] [Accepted: 04/21/2022] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Accurate bacteria genome de novo assembly is fundamental to understand the evolution and pathogenesis of new bacteria species. The advent and popularity of Third-Generation Sequencing (TGS) enables assembly of bacteria genomes at an unprecedented speed. However, most current TGS assemblers were specifically designed for human or other species that do not have a circular genome. Besides, the repetitive DNA fragments in many bacterial genomes plus the high error rate of long sequencing data make it still very challenging to accurately assemble their genomes even with a relatively small genome size. Therefore, there is an urgent need for the development of an optimized method to address these issues. RESULTS We developed B-assembler, which is capable of assembling bacterial genomes when there are only long reads or a combination of short and long reads. B-assembler takes advantage of the structural resolving power of long reads and the accuracy of short reads if applicable. It first selects and corrects the ultra-long reads to get an initial contig. Then, it collects the reads overlapping with the ends of the initial contig. This two-round assembling procedure along with optimized error correction enables a high-confidence and circularized genome assembly. Benchmarked on both synthetic and real sequencing data of several species of bacterium, the results show that both long-read-only and hybrid-read modes can accurately assemble circular bacterial genomes free of structural errors and have fewer small errors compared to other assemblers. CONCLUSIONS B-assembler provides a better solution to bacterial genome assembly, which will facilitate downstream bacterial genome analysis.
Collapse
Affiliation(s)
- Fengyuan Huang
- Informatics Institute, Heersink School of Medicine, the University of Alabama at Birmingham, AL, 35294, Birmingham, USA.,Department of Genetics, Heersink School of Medicine, the University of Alabama at Birmingham, AL, 35294, Birmingham, USA
| | - Li Xiao
- Department of Medicine, Heersink School of Medicine, the University of Alabama at Birmingham, AB, 35294, Birmingham, USA
| | - Min Gao
- Informatics Institute, Heersink School of Medicine, the University of Alabama at Birmingham, AL, 35294, Birmingham, USA.,Department of Medicine, Heersink School of Medicine, the University of Alabama at Birmingham, AB, 35294, Birmingham, USA
| | - Ethan J Vallely
- Informatics Institute, Heersink School of Medicine, the University of Alabama at Birmingham, AL, 35294, Birmingham, USA
| | - Kevin Dybvig
- Department of Genetics, Heersink School of Medicine, the University of Alabama at Birmingham, AL, 35294, Birmingham, USA.,Department of Pediatrics, Heersink School of Medicine, the University of Alabama at Birmingham, AL, 35233, Birmingham, USA
| | - T Prescott Atkinson
- Department of Pediatrics, Heersink School of Medicine, the University of Alabama at Birmingham, AL, 35233, Birmingham, USA
| | - Ken B Waites
- Department of Pathology, Heersink School of Medicine, the University of Alabama at Birmingham, AL, 35233, Birmingham, USA
| | - Zechen Chong
- Informatics Institute, Heersink School of Medicine, the University of Alabama at Birmingham, AL, 35294, Birmingham, USA. .,Department of Genetics, Heersink School of Medicine, the University of Alabama at Birmingham, AL, 35294, Birmingham, USA.
| |
Collapse
|
12
|
Du Q, Jiang M, Sun S, Wang L, Liu S, Jiang C, Gao H, Chen H, Li Y, Wang B, Liu C. The complete chloroplast genome sequence of Clerodendranthus spicatus, a medicinal plant for preventing and treating kidney diseases from Lamiaceae family. Mol Biol Rep 2022; 49:3073-3083. [PMID: 35059973 DOI: 10.1007/s11033-022-07135-4] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/28/2021] [Accepted: 01/10/2022] [Indexed: 12/25/2022]
Abstract
BACKGROUND Clerodendranthus spicatus (Thunb.) C. Y. Wu ex H. W. Li is one of the most important medicines for the treatment of nephrology in the southeast regions of China. To understand the taxonomic classification of Clerodendranthus species and identify species discrimination markers, we sequenced and characterized its chloroplast genome in the current study. METHODS AND RESULTS Total genomic DNA were isolated from dried leaves of C. spicatus and sequenced using an Illumina sequencing platform. The data were assembled and annotated by the NOVOPlasty software and CpGAVAS2 web service. The complete chloroplast genome of C. spicatus was 152,155 bp, including a large single-copy region of 83,098 bp, a small single-copy region of 17,665 bp, and a pair of inverted repeat regions of 25,696 bp. The Isoleucine codons are the most abundant, accounting for 4.17% of all codons. The codons of AUG, UUA, and AGA demonstrated a high degree of usage bias. Twenty-eight simple sequence repeats, thirty-six tandem repeats, and forty interspersed repeats were identified. The distribution of the specific rps19, ycf1, rpl2, trnH, psbA genes were analyzed. Analysis of the genetic distance of the intergenic spacer regions shows that ndhG-ndhI, accD-psaI, rps15-ycf1, rpl20-clpP, ccsA-ndhD regions have high K2p values. Phylogenetic analysis showed that C. spicatu is closely related to two Lamiaceae species, Tectona grandis, and Glechoma longituba. CONCLUSIONS In this study, we sequenced and characterized the chloroplast genome of C. spicatus. Phylogenomic analysis has identified species closely related to C. spicatus, which represent potential candidates for the development of drugs improving renal functions.
Collapse
Affiliation(s)
- Qing Du
- Institute of Medicinal Plant Development, Chinese Academy of Medical Sciences, Peking Union Medical College, Beijing, 100193, People's Republic of China.,College of Pharmacy, Qinghai Provincial Key Laboratory of Phytochemistry of Qinghai Tibet Plateau, Qinghai Minzu University, Xining, Qinghai, 810007, People's Republic of China.,Fresh Sky-Right (Beijing) International Science and Technology Co. Ltd, Beijing, 100187, People's Republic of China
| | - Mei Jiang
- Institute of Medicinal Plant Development, Chinese Academy of Medical Sciences, Peking Union Medical College, Beijing, 100193, People's Republic of China.,School of Pharmaceutical Sciences, Qilu University of Technology, Shandong Academy of Sciences, Jinan, Shandong, 250353, People's Republic of China
| | - Sihui Sun
- College of Pharmacy, Xiangnan University, Chenzhou, Hunan, 423000, People's Republic of China
| | - Liqiang Wang
- College of Pharmacy, Heze University, Heze, Shandong, 274015, People's Republic of China
| | - Shengyu Liu
- Institute of Medical Information & Library, Chinese Academy of Medical Sciences & Peking Union Medical College, Beijing, 100193, People's Republic of China
| | - Chuanbei Jiang
- Genepioneer Biotechnologies Inc., Nanjing, Jiangsu, 210023, People's Republic of China
| | - Haidong Gao
- Genepioneer Biotechnologies Inc., Nanjing, Jiangsu, 210023, People's Republic of China
| | - Haimei Chen
- Institute of Medicinal Plant Development, Chinese Academy of Medical Sciences, Peking Union Medical College, Beijing, 100193, People's Republic of China
| | - Yong Li
- College of Pharmacy, Xiangnan University, Chenzhou, Hunan, 423000, People's Republic of China
| | - Bin Wang
- College of Pharmacy, Xiangnan University, Chenzhou, Hunan, 423000, People's Republic of China.
| | - Chang Liu
- Institute of Medicinal Plant Development, Chinese Academy of Medical Sciences, Peking Union Medical College, Beijing, 100193, People's Republic of China.
| |
Collapse
|
13
|
Lee JY, Kong M, Oh J, Lim J, Chung SH, Kim JM, Kim JS, Kim KH, Yoo JC, Kwak W. Comparative evaluation of Nanopore polishing tools for microbial genome assembly and polishing strategies for downstream analysis. Sci Rep 2021; 11:20740. [PMID: 34671046 PMCID: PMC8528807 DOI: 10.1038/s41598-021-00178-w] [Citation(s) in RCA: 25] [Impact Index Per Article: 8.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/06/2021] [Accepted: 10/07/2021] [Indexed: 01/22/2023] Open
Abstract
Assembling high-quality microbial genomes using only cost-effective Nanopore long-read systems such as Flongle is important to accelerate research on the microbial genome and the most critical point for this is the polishing process. In this study, we performed an evaluation based on BUSCO and Prokka gene prediction in terms of microbial genome assembly for eight state-of-the-art Nanopore polishing tools and combinations available. In the evaluation of individual tools, Homopolish, PEPPER, and Medaka demonstrated better results than others. In combination polishing, the second round Homopolish, and the PEPPER × medaka combination also showed better results than others. However, individual tools and combinations have specific limitations on usage and results. Depending on the target organism and the purpose of the downstream research, it is confirmed that there remain some difficulties in perfectly replacing the hybrid polishing carried out by the addition of a short-read. Nevertheless, through continuous improvement of the protein pores, related base-calling algorithms, and polishing tools based on improved error models, a high-quality microbial genome can be achieved using only Nanopore reads without the production of additional short-read data. The polishing strategy proposed in this study is expected to provide useful information for assembling the microbial genome using only Nanopore reads depending on the target microorganism and the purpose of the research.
Collapse
Affiliation(s)
| | | | - Jinjoo Oh
- JCBio. Co., Ltd., Seoul, 05836, Korea
| | - JinSoo Lim
- Department of Agricultural Biotechnology and Research Institute of Agriculture and Life Sciences, Seoul National University, Seoul, 05836, Korea
| | - Sung Hee Chung
- Department of Laboratory Medicine, Kangdong Sacred Heart Hospital, Hallym University College of Medicine, Seoul, Korea
| | - Jung-Min Kim
- Department of Laboratory Medicine, Kangdong Sacred Heart Hospital, Hallym University College of Medicine, Seoul, Korea
| | - Jae-Seok Kim
- Department of Laboratory Medicine, Kangdong Sacred Heart Hospital, Hallym University College of Medicine, Seoul, Korea
| | | | | | - Woori Kwak
- Gencube Plus, Seoul, 08592, Korea.
- Hoonygen, Seoul, 08592, Korea.
| |
Collapse
|
14
|
Huang N, Nie F, Ni P, Gao X, Luo F, Wang J. BlockPolish: accurate polishing of long-read assembly via block divide-and-conquer. Brief Bioinform 2021; 23:6383560. [PMID: 34619757 DOI: 10.1093/bib/bbab405] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/19/2021] [Revised: 08/13/2021] [Accepted: 09/03/2021] [Indexed: 11/13/2022] Open
Abstract
Long-read sequencing technology enables significant progress in de novo genome assembly. However, the high error rate and the wide error distribution of raw reads result in a large number of errors in the assembly. Polishing is a procedure to fix errors in the draft assembly and improve the reliability of genomic analysis. However, existing methods treat all the regions of the assembly equally while there are fundamental differences between the error distributions of these regions. How to achieve very high accuracy in genome assembly is still a challenging problem. Motivated by the uneven errors in different regions of the assembly, we propose a novel polishing workflow named BlockPolish. In this method, we divide contigs into blocks with low complexity and high complexity according to statistics of aligned nucleotide bases. Multiple sequence alignment is applied to realign raw reads in complex blocks and optimize the alignment result. Due to the different distributions of error rates in trivial and complex blocks, two multitask bidirectional Long short-term memory (LSTM) networks are proposed to predict the consensus sequences. In the whole-genome assemblies of NA12878 assembled by Wtdbg2 and Flye using Nanopore data, BlockPolish has a higher polishing accuracy than other state-of-the-arts including Racon, Medaka and MarginPolish & HELEN. In all assemblies, errors are predominantly indels and BlockPolish has a good performance in correcting them. In addition to the Nanopore assemblies, we further demonstrate that BlockPolish can also reduce the errors in the PacBio assemblies. The source code of BlockPolish is freely available on Github (https://github.com/huangnengCSU/BlockPolish).
Collapse
Affiliation(s)
- Neng Huang
- School of Computer Science and Engineering, Central South University, China
| | - Fan Nie
- School of Computer Science and Engineering, Central South University, China
| | - Peng Ni
- School of Computer Science and Engineering, Central South University, China
| | - Xin Gao
- School of Computer Science, King Abdullah University of Science and Technology, Saudi Arabia
| | - Feng Luo
- School of Computing, Clemson University, USA
| | - Jianxin Wang
- School of Computer Science and Engineering, Central South University, China
| |
Collapse
|
15
|
Alser M, Rotman J, Deshpande D, Taraszka K, Shi H, Baykal PI, Yang HT, Xue V, Knyazev S, Singer BD, Balliu B, Koslicki D, Skums P, Zelikovsky A, Alkan C, Mutlu O, Mangul S. Technology dictates algorithms: recent developments in read alignment. Genome Biol 2021; 22:249. [PMID: 34446078 PMCID: PMC8390189 DOI: 10.1186/s13059-021-02443-7] [Citation(s) in RCA: 37] [Impact Index Per Article: 12.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/15/2020] [Accepted: 07/28/2021] [Indexed: 01/08/2023] Open
Abstract
Aligning sequencing reads onto a reference is an essential step of the majority of genomic analysis pipelines. Computational algorithms for read alignment have evolved in accordance with technological advances, leading to today's diverse array of alignment methods. We provide a systematic survey of algorithmic foundations and methodologies across 107 alignment methods, for both short and long reads. We provide a rigorous experimental evaluation of 11 read aligners to demonstrate the effect of these underlying algorithms on speed and efficiency of read alignment. We discuss how general alignment algorithms have been tailored to the specific needs of various domains in biology.
Collapse
Affiliation(s)
- Mohammed Alser
- Computer Science Department, ETH Zürich, 8092, Zürich, Switzerland
- Computer Engineering Department, Bilkent University, 06800 Bilkent, Ankara, Turkey
- Information Technology and Electrical Engineering Department, ETH Zürich, Zürich, 8092, Switzerland
| | - Jeremy Rotman
- Department of Computer Science, University of California Los Angeles, Los Angeles, CA, 90095, USA
| | - Dhrithi Deshpande
- Department of Clinical Pharmacy, School of Pharmacy, University of Southern California, Los Angeles, CA, 90089, USA
| | - Kodi Taraszka
- Department of Computer Science, University of California Los Angeles, Los Angeles, CA, 90095, USA
| | - Huwenbo Shi
- Department of Epidemiology, Harvard T.H. Chan School of Public Health, Boston, MA, 02115, USA
- Program in Medical and Population Genetics, Broad Institute of MIT and Harvard, Cambridge, MA, 02142, USA
| | - Pelin Icer Baykal
- Department of Computer Science, Georgia State University, Atlanta, GA, 30302, USA
| | - Harry Taegyun Yang
- Department of Computer Science, University of California Los Angeles, Los Angeles, CA, 90095, USA
- Bioinformatics Interdepartmental Ph.D. Program, University of California Los Angeles, Los Angeles, CA, 90095, USA
| | - Victor Xue
- Department of Computer Science, University of California Los Angeles, Los Angeles, CA, 90095, USA
| | - Sergey Knyazev
- Department of Computer Science, Georgia State University, Atlanta, GA, 30302, USA
| | - Benjamin D Singer
- Division of Pulmonary and Critical Care Medicine, Northwestern University Feinberg School of Medicine, Chicago, IL, 60611, USA
- Department of Biochemistry & Molecular Genetics, Northwestern University Feinberg School of Medicine, Chicago, USA
- Simpson Querrey Institute for Epigenetics, Northwestern University Feinberg School of Medicine, Chicago, IL, 60611, USA
| | - Brunilda Balliu
- Department of Computational Medicine, University of California Los Angeles, Los Angeles, CA, 90095, USA
| | - David Koslicki
- Computer Science and Engineering, Pennsylvania State University, University Park, PA, 16801, USA
- Biology Department, Pennsylvania State University, University Park, PA, 16801, USA
- The Huck Institutes of the Life Sciences, Pennsylvania State University, University Park, PA, 16801, USA
| | - Pavel Skums
- Department of Computer Science, Georgia State University, Atlanta, GA, 30302, USA
| | - Alex Zelikovsky
- Department of Computer Science, Georgia State University, Atlanta, GA, 30302, USA
- The Laboratory of Bioinformatics, I.M. Sechenov First Moscow State Medical University, Moscow, 119991, Russia
| | - Can Alkan
- Computer Engineering Department, Bilkent University, 06800 Bilkent, Ankara, Turkey
- Bilkent-Hacettepe Health Sciences and Technologies Program, Ankara, Turkey
| | - Onur Mutlu
- Computer Science Department, ETH Zürich, 8092, Zürich, Switzerland
- Computer Engineering Department, Bilkent University, 06800 Bilkent, Ankara, Turkey
- Information Technology and Electrical Engineering Department, ETH Zürich, Zürich, 8092, Switzerland
| | - Serghei Mangul
- Department of Clinical Pharmacy, School of Pharmacy, University of Southern California, Los Angeles, CA, 90089, USA.
| |
Collapse
|
16
|
Abstract
Hybridization is an important evolutionary mechanism that can enable organisms to adapt to environmental challenges. It has previously been shown that the fungal allodiploid species Verticillium longisporum, the causal agent of verticillium stem striping in rapeseed, originated from at least three independent hybridization events between two haploid Verticillium species. To reveal the impact of genome duplication as a consequence of hybridization, we studied the genome and transcriptome dynamics upon two independent V. longisporum hybridization events, represented by the hybrid lineages “A1/D1” and “A1/D3.” We show that V. longisporum genomes are characterized by extensive chromosomal rearrangements, including between parental chromosomal sets. V. longisporum hybrids display signs of evolutionary dynamics that are typically associated with the aftermath of allodiploidization, such as haploidization and more relaxed gene evolution. The expression patterns of the two subgenomes within the two hybrid lineages are more similar than those of the shared A1 parent between the two lineages, showing that the expression patterns of the parental genomes homogenized within a lineage. However, as genes that display differential parental expression in planta do not typically display the same pattern in vitro, we conclude that subgenome-specific responses occur in both lineages. Overall, our study uncovers genomic and transcriptomic plasticity during the evolution of the filamentous fungal hybrid V. longisporum and illustrates its adaptive potential.
Collapse
|
17
|
Huang N, Nie F, Ni P, Luo F, Gao X, Wang J. NeuralPolish: a novel Nanopore polishing method based on alignment matrix construction and orthogonal Bi-GRU Networks. Bioinformatics 2021; 37:3120-3127. [PMID: 33973998 DOI: 10.1093/bioinformatics/btab354] [Citation(s) in RCA: 6] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/15/2021] [Revised: 03/29/2021] [Accepted: 05/06/2021] [Indexed: 01/28/2023] Open
Abstract
MOTIVATION Oxford Nanopore sequencing producing long reads at low cost has made many breakthroughs in genomics studies. However, the large number of errors in Nanopore genome assembly affect the accuracy of genome analysis. Polishing is a procedure to correct the errors in genome assembly and can improve the reliability of the downstream analysis. However, the performances of the existing polishing methods are still not satisfactory. RESULTS We developed a novel polishing method, NeuralPolish, to correct the errors in assemblies based on alignment matrix construction and orthogonal Bi-GRU networks. In this method, we designed an alignment feature matrix for representing read-to-assembly alignment. Each row of the matrix represents a read, and each column represents the aligned bases at each position of the contig. In the network architecture, a bi-directional GRU network is used to extract the sequence information inside each read by processing the alignment matrix row by row. After that, the feature matrix is processed by another bi-directional GRU network column by column to calculate the probability distribution. Finally, a CTC decoder generates a polished sequence with a greedy algorithm. We used five real data sets and three assembly tools including Wtdbg2, Flye and Canu for testing, and compared the results of different polishing methods including NeuralPolish, Racon, MarginPolish, HELEN and Medaka. Comprehensive experiments demonstrate that NeuralPolish achieves more accurate assembly with fewer errors than other polishing methods and can improve the accuracy of assembly obtained by different assemblers. AVAILABILITY https://github.com/huangnengCSU/NeuralPolish.git.
Collapse
Affiliation(s)
- Neng Huang
- School of Computer Science and Engineering, Central South University, Changsha, 410083, China.,Hunan Provincial Key Lab on Bioinformatics, Central South University, Changsha, 410083, China
| | - Fan Nie
- School of Computer Science and Engineering, Central South University, Changsha, 410083, China.,Hunan Provincial Key Lab on Bioinformatics, Central South University, Changsha, 410083, China
| | - Peng Ni
- School of Computer Science and Engineering, Central South University, Changsha, 410083, China.,Hunan Provincial Key Lab on Bioinformatics, Central South University, Changsha, 410083, China
| | - Feng Luo
- School of Computing, Clemson University, South Carolina, 29634, USA
| | - Xin Gao
- King Abdullah University of Science and Technology (KAUST), Computational Bioscience Research Center (CBRC), Computer, Electrical and Mathematical Sciences and Engineering (CEMSE) Division, Thuwal 23955-6900, Saudi Arabia
| | - Jianxin Wang
- School of Computer Science and Engineering, Central South University, Changsha, 410083, China.,Hunan Provincial Key Lab on Bioinformatics, Central South University, Changsha, 410083, China
| |
Collapse
|
18
|
Aury JM, Istace B. Hapo-G, haplotype-aware polishing of genome assemblies with accurate reads. NAR Genom Bioinform 2021; 3:lqab034. [PMID: 33987534 PMCID: PMC8092372 DOI: 10.1093/nargab/lqab034] [Citation(s) in RCA: 30] [Impact Index Per Article: 10.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/21/2020] [Revised: 03/18/2021] [Accepted: 04/13/2021] [Indexed: 12/11/2022] Open
Abstract
Single-molecule sequencing technologies have recently been commercialized by Pacific Biosciences and Oxford Nanopore with the promise of sequencing long DNA fragments (kilobases to megabases order) and then, using efficient algorithms, provide high quality assemblies in terms of contiguity and completeness of repetitive regions. However, the error rate of long-read technologies is higher than that of short-read technologies. This has a direct consequence on the base quality of genome assemblies, particularly in coding regions where sequencing errors can disrupt the coding frame of genes. In the case of diploid genomes, the consensus of a given gene can be a mixture between the two haplotypes and can lead to premature stop codons. Several methods have been developed to polish genome assemblies using short reads and generally, they inspect the nucleotide one by one, and provide a correction for each nucleotide of the input assembly. As a result, these algorithms are not able to properly process diploid genomes and they typically switch from one haplotype to another. Herein we proposed Hapo-G (Haplotype-Aware Polishing Of Genomes), a new algorithm capable of incorporating phasing information from high-quality reads (short or long-reads) to polish genome assemblies and in particular assemblies of diploid and heterozygous genomes.
Collapse
Affiliation(s)
- Jean-Marc Aury
- Génomique Métabolique, Genoscope, Institut François Jacob, CEA, CNRS, Univ Evry, Université Paris-Saclay, 91057 Evry, France
| | - Benjamin Istace
- Génomique Métabolique, Genoscope, Institut François Jacob, CEA, CNRS, Univ Evry, Université Paris-Saclay, 91057 Evry, France
| |
Collapse
|
19
|
Mamede R, Vila-Cerqueira P, Silva M, Carriço JA, Ramirez M. Chewie Nomenclature Server (chewie-NS): a deployable nomenclature server for easy sharing of core and whole genome MLST schemas. Nucleic Acids Res 2021; 49:D660-D666. [PMID: 33068420 PMCID: PMC7778912 DOI: 10.1093/nar/gkaa889] [Citation(s) in RCA: 9] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/07/2020] [Revised: 09/18/2020] [Accepted: 10/02/2020] [Indexed: 02/04/2023] Open
Abstract
Chewie Nomenclature Server (chewie-NS, https://chewbbaca.online/) allows users to share genome-based gene-by-gene typing schemas and to maintain a common nomenclature, simplifying the comparison of results. The combination between local analyses and a public repository of allelic data strikes a balance between potential confidentiality issues and the need to compare results. The possibility of deploying private instances of chewie-NS facilitates the creation of nomenclature servers with a restricted user base to allow compliance with the strictest data policies. Chewie-NS allows users to easily share their own schemas and to explore publicly available schemas, including informative statistics on schemas and loci presented in interactive charts and tables. Users can retrieve all the information necessary to run a schema locally or all the alleles identified at a particular locus. The integration with the chewBBACA suite enables users to directly upload new schemas to chewie-NS, download existing schemas and synchronize local and remote schemas from chewBBACA command line version, allowing an easier integration into high-throughput analysis pipelines. The same REST API linking chewie-NS and the chewBBACA suite supports the interaction of other interfaces or pipelines with the databases available at chewie-NS, facilitating the reusability of the stored data.
Collapse
Affiliation(s)
- Rafael Mamede
- Instituto de Microbiologia and Instituto de Medicina Molecular João Lobo Antunes, Faculdade de Medicina, Universidade de Lisboa, Av. Professor Egas Moniz, 1649-028 Lisboa, Portugal
| | - Pedro Vila-Cerqueira
- Instituto de Microbiologia and Instituto de Medicina Molecular João Lobo Antunes, Faculdade de Medicina, Universidade de Lisboa, Av. Professor Egas Moniz, 1649-028 Lisboa, Portugal
| | - Mickael Silva
- Instituto de Microbiologia and Instituto de Medicina Molecular João Lobo Antunes, Faculdade de Medicina, Universidade de Lisboa, Av. Professor Egas Moniz, 1649-028 Lisboa, Portugal
| | - João A Carriço
- Instituto de Microbiologia and Instituto de Medicina Molecular João Lobo Antunes, Faculdade de Medicina, Universidade de Lisboa, Av. Professor Egas Moniz, 1649-028 Lisboa, Portugal
| | - Mário Ramirez
- Instituto de Microbiologia and Instituto de Medicina Molecular João Lobo Antunes, Faculdade de Medicina, Universidade de Lisboa, Av. Professor Egas Moniz, 1649-028 Lisboa, Portugal
| |
Collapse
|
20
|
Campoy JA, Sun H, Goel M, Jiao WB, Folz-Donahue K, Wang N, Rubio M, Liu C, Kukat C, Ruiz D, Huettel B, Schneeberger K. Gamete binning: chromosome-level and haplotype-resolved genome assembly enabled by high-throughput single-cell sequencing of gamete genomes. Genome Biol 2020; 21:306. [PMID: 33372615 DOI: 10.1101/2020.04.24.060046] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/15/2020] [Accepted: 12/11/2020] [Indexed: 05/26/2023] Open
Abstract
Generating chromosome-level, haplotype-resolved assemblies of heterozygous genomes remains challenging. To address this, we developed gamete binning, a method based on single-cell sequencing of haploid gametes enabling separation of the whole-genome sequencing reads into haplotype-specific reads sets. After assembling the reads of each haplotype, the contigs are scaffolded to chromosome level using a genetic map derived from the gametes. We assemble the two genomes of a diploid apricot tree based on whole-genome sequencing of 445 individual pollen grains. The two haplotype assemblies (N50: 25.5 and 25.8 Mb) feature a haplotyping precision of greater than 99% and are accurately scaffolded to chromosome-level.
Collapse
Affiliation(s)
- José A Campoy
- Department of Chromosome Biology, Max Planck Institute for Plant Breeding Research, Carl-von-Linné-Weg 10, 50829, Cologne, Germany
| | - Hequan Sun
- Department of Chromosome Biology, Max Planck Institute for Plant Breeding Research, Carl-von-Linné-Weg 10, 50829, Cologne, Germany
- Faculty of Biology, LMU Munich, Großhaderner Str. 2, 82152, Planegg-Martinsried, Germany
| | - Manish Goel
- Department of Chromosome Biology, Max Planck Institute for Plant Breeding Research, Carl-von-Linné-Weg 10, 50829, Cologne, Germany
| | - Wen-Biao Jiao
- Department of Chromosome Biology, Max Planck Institute for Plant Breeding Research, Carl-von-Linné-Weg 10, 50829, Cologne, Germany
| | - Kat Folz-Donahue
- FACS & Imaging Core Facility, Max Planck Institute for Biology of Ageing, 50931, Cologne, Germany
| | - Nan Wang
- Center for Plant Molecular Biology (ZMBP), University of Tübingen, Auf der Morgenstelle 32, 72076, Tübingen, Germany
| | - Manuel Rubio
- Departament of Plant Breeding, CEBAS-CSIC, PO Box 164, E-30100 Espinardo, Murcia, Spain
| | - Chang Liu
- Center for Plant Molecular Biology (ZMBP), University of Tübingen, Auf der Morgenstelle 32, 72076, Tübingen, Germany
- Institute of Biology, University of Hohenheim, Garbenstraße 30, 70599, Stuttgart, Germany
| | - Christian Kukat
- FACS & Imaging Core Facility, Max Planck Institute for Biology of Ageing, 50931, Cologne, Germany
| | - David Ruiz
- Departament of Plant Breeding, CEBAS-CSIC, PO Box 164, E-30100 Espinardo, Murcia, Spain
| | - Bruno Huettel
- Max Planck-Genome-center Cologne, Carl-von-Linné-Weg 10, 50829, Cologne, Germany
| | - Korbinian Schneeberger
- Department of Chromosome Biology, Max Planck Institute for Plant Breeding Research, Carl-von-Linné-Weg 10, 50829, Cologne, Germany.
- Faculty of Biology, LMU Munich, Großhaderner Str. 2, 82152, Planegg-Martinsried, Germany.
| |
Collapse
|
21
|
Campoy JA, Sun H, Goel M, Jiao WB, Folz-Donahue K, Wang N, Rubio M, Liu C, Kukat C, Ruiz D, Huettel B, Schneeberger K. Gamete binning: chromosome-level and haplotype-resolved genome assembly enabled by high-throughput single-cell sequencing of gamete genomes. Genome Biol 2020; 21:306. [PMID: 33372615 PMCID: PMC7771071 DOI: 10.1186/s13059-020-02235-5] [Citation(s) in RCA: 37] [Impact Index Per Article: 9.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/15/2020] [Accepted: 12/11/2020] [Indexed: 12/30/2022] Open
Abstract
Generating chromosome-level, haplotype-resolved assemblies of heterozygous genomes remains challenging. To address this, we developed gamete binning, a method based on single-cell sequencing of haploid gametes enabling separation of the whole-genome sequencing reads into haplotype-specific reads sets. After assembling the reads of each haplotype, the contigs are scaffolded to chromosome level using a genetic map derived from the gametes. We assemble the two genomes of a diploid apricot tree based on whole-genome sequencing of 445 individual pollen grains. The two haplotype assemblies (N50: 25.5 and 25.8 Mb) feature a haplotyping precision of greater than 99% and are accurately scaffolded to chromosome-level.
Collapse
Affiliation(s)
- José A Campoy
- Department of Chromosome Biology, Max Planck Institute for Plant Breeding Research, Carl-von-Linné-Weg 10, 50829, Cologne, Germany
| | - Hequan Sun
- Department of Chromosome Biology, Max Planck Institute for Plant Breeding Research, Carl-von-Linné-Weg 10, 50829, Cologne, Germany
- Faculty of Biology, LMU Munich, Großhaderner Str. 2, 82152, Planegg-Martinsried, Germany
| | - Manish Goel
- Department of Chromosome Biology, Max Planck Institute for Plant Breeding Research, Carl-von-Linné-Weg 10, 50829, Cologne, Germany
| | - Wen-Biao Jiao
- Department of Chromosome Biology, Max Planck Institute for Plant Breeding Research, Carl-von-Linné-Weg 10, 50829, Cologne, Germany
| | - Kat Folz-Donahue
- FACS & Imaging Core Facility, Max Planck Institute for Biology of Ageing, 50931, Cologne, Germany
| | - Nan Wang
- Center for Plant Molecular Biology (ZMBP), University of Tübingen, Auf der Morgenstelle 32, 72076, Tübingen, Germany
| | - Manuel Rubio
- Departament of Plant Breeding, CEBAS-CSIC, PO Box 164, E-30100 Espinardo, Murcia, Spain
| | - Chang Liu
- Center for Plant Molecular Biology (ZMBP), University of Tübingen, Auf der Morgenstelle 32, 72076, Tübingen, Germany
- Institute of Biology, University of Hohenheim, Garbenstraße 30, 70599, Stuttgart, Germany
| | - Christian Kukat
- FACS & Imaging Core Facility, Max Planck Institute for Biology of Ageing, 50931, Cologne, Germany
| | - David Ruiz
- Departament of Plant Breeding, CEBAS-CSIC, PO Box 164, E-30100 Espinardo, Murcia, Spain
| | - Bruno Huettel
- Max Planck-Genome-center Cologne, Carl-von-Linné-Weg 10, 50829, Cologne, Germany
| | - Korbinian Schneeberger
- Department of Chromosome Biology, Max Planck Institute for Plant Breeding Research, Carl-von-Linné-Weg 10, 50829, Cologne, Germany.
- Faculty of Biology, LMU Munich, Großhaderner Str. 2, 82152, Planegg-Martinsried, Germany.
| |
Collapse
|
22
|
Alser M, Shahroodi T, Gómez-Luna J, Alkan C, Mutlu O. SneakySnake: a fast and accurate universal genome pre-alignment filter for CPUs, GPUs and FPGAs. Bioinformatics 2020; 36:5282-5290. [PMID: 33315064 DOI: 10.1093/bioinformatics/btaa1015] [Citation(s) in RCA: 9] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/10/2020] [Revised: 09/30/2020] [Accepted: 11/24/2020] [Indexed: 11/14/2022] Open
Abstract
Abstract
Motivation
We introduce SneakySnake, a highly parallel and highly accurate pre-alignment filter that remarkably reduces the need for computationally costly sequence alignment. The key idea of SneakySnake is to reduce the approximate string matching (ASM) problem to the single net routing (SNR) problem in VLSI chip layout. In the SNR problem, we are interested in finding the optimal path that connects two terminals with the least routing cost on a special grid layout that contains obstacles. The SneakySnake algorithm quickly solves the SNR problem and uses the found optimal path to decide whether or not performing sequence alignment is necessary. Reducing the ASM problem into SNR also makes SneakySnake efficient to implement on CPUs, GPUs and FPGAs.
Results
SneakySnake significantly improves the accuracy of pre-alignment filtering by up to four orders of magnitude compared to the state-of-the-art pre-alignment filters, Shouji, GateKeeper and SHD. For short sequences, SneakySnake accelerates Edlib (state-of-the-art implementation of Myers’s bit-vector algorithm) and Parasail (state-of-the-art sequence aligner with a configurable scoring function), by up to 37.7× and 43.9× (>12× on average), respectively, with its CPU implementation, and by up to 413× and 689× (>400× on average), respectively, with FPGA and GPU acceleration. For long sequences, the CPU implementation of SneakySnake accelerates Parasail and KSW2 (sequence aligner of minimap2) by up to 979× (276.9× on average) and 91.7× (31.7× on average), respectively. As SneakySnake does not replace sequence alignment, users can still obtain all capabilities (e.g. configurable scoring functions) of the aligner of their choice, unlike existing acceleration efforts that sacrifice some aligner capabilities.
Availabilityand implementation
https://github.com/CMU-SAFARI/SneakySnake.
Supplementary information
Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Mohammed Alser
- Department of Computer Science, ETH Zurich, Zurich 8006, Switzerland
- Department of Information Technology and Electrical Engineering, ETH Zurich, Zurich 8006, Switzerland
| | - Taha Shahroodi
- Department of Computer Science, ETH Zurich, Zurich 8006, Switzerland
| | - Juan Gómez-Luna
- Department of Computer Science, ETH Zurich, Zurich 8006, Switzerland
- Department of Information Technology and Electrical Engineering, ETH Zurich, Zurich 8006, Switzerland
| | - Can Alkan
- Department of Computer Engineering, Bilkent University, Ankara 06800, Turkey
| | - Onur Mutlu
- Department of Computer Science, ETH Zurich, Zurich 8006, Switzerland
- Department of Information Technology and Electrical Engineering, ETH Zurich, Zurich 8006, Switzerland
- Department of Computer Engineering, Bilkent University, Ankara 06800, Turkey
- Department of Electrical and Computer Engineering, Carnegie Mellon University, Pittsburgh, PA 15213, USA
| |
Collapse
|