1
|
Moeckel C, Mareboina M, Konnaris MA, Chan CS, Mouratidis I, Montgomery A, Chantzi N, Pavlopoulos GA, Georgakopoulos-Soares I. A survey of k-mer methods and applications in bioinformatics. Comput Struct Biotechnol J 2024; 23:2289-2303. [PMID: 38840832 PMCID: PMC11152613 DOI: 10.1016/j.csbj.2024.05.025] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/13/2024] [Revised: 05/14/2024] [Accepted: 05/15/2024] [Indexed: 06/07/2024] Open
Abstract
The rapid progression of genomics and proteomics has been driven by the advent of advanced sequencing technologies, large, diverse, and readily available omics datasets, and the evolution of computational data processing capabilities. The vast amount of data generated by these advancements necessitates efficient algorithms to extract meaningful information. K-mers serve as a valuable tool when working with large sequencing datasets, offering several advantages in computational speed and memory efficiency and carrying the potential for intrinsic biological functionality. This review provides an overview of the methods, applications, and significance of k-mers in genomic and proteomic data analyses, as well as the utility of absent sequences, including nullomers and nullpeptides, in disease detection, vaccine development, therapeutics, and forensic science. Therefore, the review highlights the pivotal role of k-mers in addressing current genomic and proteomic problems and underscores their potential for future breakthroughs in research.
Collapse
Affiliation(s)
- Camille Moeckel
- Institute for Personalized Medicine, Department of Biochemistry and Molecular Biology, The Pennsylvania State University College of Medicine, Hershey, PA, USA
| | - Manvita Mareboina
- Institute for Personalized Medicine, Department of Biochemistry and Molecular Biology, The Pennsylvania State University College of Medicine, Hershey, PA, USA
| | - Maxwell A. Konnaris
- Institute for Personalized Medicine, Department of Biochemistry and Molecular Biology, The Pennsylvania State University College of Medicine, Hershey, PA, USA
| | - Candace S.Y. Chan
- Department of Bioengineering and Therapeutic Sciences, University of California San Francisco, San Francisco, CA, USA
| | - Ioannis Mouratidis
- Institute for Personalized Medicine, Department of Biochemistry and Molecular Biology, The Pennsylvania State University College of Medicine, Hershey, PA, USA
- Huck Institute of the Life Sciences, Penn State University, University Park, Pennsylvania, USA
| | - Austin Montgomery
- Institute for Personalized Medicine, Department of Biochemistry and Molecular Biology, The Pennsylvania State University College of Medicine, Hershey, PA, USA
| | - Nikol Chantzi
- Institute for Personalized Medicine, Department of Biochemistry and Molecular Biology, The Pennsylvania State University College of Medicine, Hershey, PA, USA
| | | | - Ilias Georgakopoulos-Soares
- Institute for Personalized Medicine, Department of Biochemistry and Molecular Biology, The Pennsylvania State University College of Medicine, Hershey, PA, USA
- Huck Institute of the Life Sciences, Penn State University, University Park, Pennsylvania, USA
| |
Collapse
|
2
|
Darian JC, Kundu R, Rajaby R, Sung WK. Constructing telomere-to-telomere diploid genome by polishing haploid nanopore-based assembly. Nat Methods 2024; 21:574-583. [PMID: 38459383 DOI: 10.1038/s41592-023-02141-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/08/2022] [Accepted: 11/30/2023] [Indexed: 03/10/2024]
Abstract
Draft genomes generated from Oxford Nanopore Technologies (ONT) long reads are known to have a higher error rate. Although existing genome polishers can enhance their quality, the error rate (including mismatches, indels and switching errors between paternal and maternal haplotypes) can be significant. Here, we develop two polishers, hypo-short and hypo-hybrid to address this issue. Hypo-short utilizes Illumina short reads to polish an ONT-based draft assembly, resulting in a high-quality assembly with low error rates and switching errors. Expanding on this, hypo-hybrid incorporates ONT long reads to further refine the assembly into a diploid representation. Leveraging on hypo-hybrid, we have created a diploid genome assembly pipeline called hypo-assembler. Hypo-assembler automates the generation of highly accurate, contiguous and nearly complete diploid assemblies using ONT long reads, Illumina short reads and optionally Hi-C reads. Notably, our solution even allows for the production of telomere-to-telomere diploid genomes with additional manual steps. As a proof of concept, we successfully assembled a fully phased telomere-to-telomere diploid genome of HG00733, achieving a quality value exceeding 50.
Collapse
Affiliation(s)
| | - Ritu Kundu
- School of Computing, National University of Singapore, Singapore, Singapore
| | | | - Wing-Kin Sung
- School of Computing, National University of Singapore, Singapore, Singapore.
- Genome Institute of Singapore, Singapore, Singapore.
- Department of Chemical Pathology, The Chinese University of Hong Kong, Hong Kong, China.
- JC STEM Laboratory of Computational Genomics, Li Ka Shing Institute of Health Sciences, The Chinese University of Hong Kong, Hong Kong, China.
- Hong Kong Genome Institute, Hong Kong, China.
| |
Collapse
|
3
|
Palumbo F, Draga S, Scariolo F, Gabelli G, Sacilotto GB, Gazzola M, Barcaccia G. First genomic insights into the Mandevilla genus. FRONTIERS IN PLANT SCIENCE 2022; 13:983879. [PMID: 36051302 PMCID: PMC9426028 DOI: 10.3389/fpls.2022.983879] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 07/01/2022] [Accepted: 07/20/2022] [Indexed: 06/15/2023]
Abstract
Mandevilla (Apocynaceae) is a greatly appreciated genus in the world ornamental market. In this study, we attempted to address the poor genetic knowledge and the huge taxonomic gaps existing in this genus by analyzing a collection of 55 accessions. After cytometrically determining the triploid genome size (1,512.64 Mb) of a reference sample (variety "Mandevilla 2001"), the plastidial genome (cpDNA, 0.18 Mb) and a draft of the nuclear genome (nuDNA, 207 Mb) were assembled. While cpDNA was effective in reconstructing the phylogenesis of the Apocynaceae family based on a DNA superbarcoding approach, the nuDNA assembly length was found to be only 41% of the haploid genome size (506 Mb, predicted based on the K-mer frequency distribution). Its annotation enabled the prediction of 37,811 amino acid sequences, of which 10,562 resulted full length proteins. Among them, we identified nine proteins whose orthologs (in Catharanthus roseus) are involved in the biosynthesis of monoterpene indole alkaloids (MIAs), including catharanthine, tabersonine, and vincadifformine. The nuclear genome draft was also useful to develop a highly informative (average polymorphism information content, PIC = 0.62) set of 23 simple sequence repeat (SSR) markers that was validated on the Mandevilla collection. These results were integrated with cytometric measurements, nuclear ITS1 haplotyping and chloroplast DNA barcoding analyses to assess the origin, divergence and relationships existing among the 55 accessions object of the study. As expected, based on the scarce information available in the literature, the scenario was extremely intricate. A reasonable hypothesis is that most of the accessions represent interspecific hybrids sharing the same species as maternal parent (i.e., Mandevilla sanderi).
Collapse
Affiliation(s)
- Fabio Palumbo
- Department of Agronomy, Food, Natural Resources, Animals and the Environment, University of Padova, Padua, Italy
| | - Samela Draga
- Department of Agronomy, Food, Natural Resources, Animals and the Environment, University of Padova, Padua, Italy
| | - Francesco Scariolo
- Department of Agronomy, Food, Natural Resources, Animals and the Environment, University of Padova, Padua, Italy
| | - Giovanni Gabelli
- Department of Agronomy, Food, Natural Resources, Animals and the Environment, University of Padova, Padua, Italy
| | | | | | - Gianni Barcaccia
- Department of Agronomy, Food, Natural Resources, Animals and the Environment, University of Padova, Padua, Italy
| |
Collapse
|
4
|
Wang C, Han B. Twenty years of rice genomics research: From sequencing and functional genomics to quantitative genomics. MOLECULAR PLANT 2022; 15:593-619. [PMID: 35331914 DOI: 10.1016/j.molp.2022.03.009] [Citation(s) in RCA: 5] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 02/16/2022] [Revised: 03/04/2022] [Accepted: 03/18/2022] [Indexed: 06/14/2023]
Abstract
Since the completion of the rice genome sequencing project in 2005, we have entered the era of rice genomics, which is still in its ascendancy. Rice genomics studies can be classified into three stages: structural genomics, functional genomics, and quantitative genomics. Structural genomics refers primarily to genome sequencing for the construction of a complete map of rice genome sequence. This is fundamental for rice genetics and molecular biology research. Functional genomics aims to decode the functions of rice genes. Quantitative genomics is large-scale sequence- and statistics-based research to define the quantitative traits and genetic features of rice populations. Rice genomics has been a transformative influence on rice biological research and contributes significantly to rice breeding, making rice a good model plant for studying crop sciences.
Collapse
Affiliation(s)
- Changsheng Wang
- National Center for Gene Research, State Key Laboratory of Plant Molecular Genetics, Center for Excellence in Molecular Plant Sciences, Institute of Plant Physiology and Ecology, Chinese Academy of Sciences, Shanghai 200233, China.
| | - Bin Han
- National Center for Gene Research, State Key Laboratory of Plant Molecular Genetics, Center for Excellence in Molecular Plant Sciences, Institute of Plant Physiology and Ecology, Chinese Academy of Sciences, Shanghai 200233, China.
| |
Collapse
|
5
|
Briscoe L, Balliu B, Sankararaman S, Halperin E, Garud NR. Evaluating supervised and unsupervised background noise correction in human gut microbiome data. PLoS Comput Biol 2022; 18:e1009838. [PMID: 35130266 PMCID: PMC8853548 DOI: 10.1371/journal.pcbi.1009838] [Citation(s) in RCA: 4] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/25/2021] [Revised: 02/17/2022] [Accepted: 01/15/2022] [Indexed: 12/13/2022] Open
Abstract
The ability to predict human phenotypes and identify biomarkers of disease from metagenomic data is crucial for the development of therapeutics for microbiome-associated diseases. However, metagenomic data is commonly affected by technical variables unrelated to the phenotype of interest, such as sequencing protocol, which can make it difficult to predict phenotype and find biomarkers of disease. Supervised methods to correct for background noise, originally designed for gene expression and RNA-seq data, are commonly applied to microbiome data but may be limited because they cannot account for unmeasured sources of variation. Unsupervised approaches address this issue, but current methods are limited because they are ill-equipped to deal with the unique aspects of microbiome data, which is compositional, highly skewed, and sparse. We perform a comparative analysis of the ability of different denoising transformations in combination with supervised correction methods as well as an unsupervised principal component correction approach that is presently used in other domains but has not been applied to microbiome data to date. We find that the unsupervised principal component correction approach has comparable ability in reducing false discovery of biomarkers as the supervised approaches, with the added benefit of not needing to know the sources of variation apriori. However, in prediction tasks, it appears to only improve prediction when technical variables contribute to the majority of variance in the data. As new and larger metagenomic datasets become increasingly available, background noise correction will become essential for generating reproducible microbiome analyses. The human gut microbiome is known to play a major role in health and is associated with many diseases including colorectal cancer, obesity, and diabetes. The prediction of host phenotypes and identification of biomarkers of disease is essential for harnessing the therapeutic potential of the microbiome. However, many metagenomic datasets are affected by technical variables that introduce unwanted variation that can confound the ability to predict phenotypes and identify biomarkers. Currently, supervised methods originally designed for gene expression and RNA-seq data are commonly applied to microbiome data for correction of background noise, but they are limited in that they cannot correct for unmeasured sources of variation. Unsupervised approaches address this issue, but current methods are limited because they are ill-equipped to deal with the unique aspects of microbiome data, which is compositional, highly skewed, and sparse. We perform a comparative analysis of the ability of different denoising transformations in combination with supervised correction methods as well as an unsupervised principal component correction approach and find that all correction approaches reduce false positives for biomarker discovery. In the task of predicting phenotypes, different approaches have varying success where the unsupervised correction can improve prediction when technical variables contribute to the majority of variance in the data. As new and larger metagenomic datasets become increasingly available, background noise correction will become essential for generating reproducible microbiome analyses.
Collapse
Affiliation(s)
- Leah Briscoe
- Bioinformatics Interdepartmental Program, University of California Los Angeles, Los Angeles, California, United States of America
- * E-mail: (LB); (EH); (NRG)
| | - Brunilda Balliu
- Department of Computational Medicine, David Geffen School of Medicine, University of California Los Angeles, Los Angeles, California, United States of America
| | - Sriram Sankararaman
- Department of Computer Science, University of California Los Angeles, Los Angeles, California, United States of America
- Department of Human Genetics, David Geffen School of Medicine, University of California Los Angeles, Los Angeles, California, United States of America
- Department of Computational Medicine, David Geffen School of Medicine, University of California Los Angeles, Los Angeles, California, United States of America
| | - Eran Halperin
- Department of Computer Science, University of California Los Angeles, Los Angeles, California, United States of America
- Department of Human Genetics, David Geffen School of Medicine, University of California Los Angeles, Los Angeles, California, United States of America
- Department of Computational Medicine, David Geffen School of Medicine, University of California Los Angeles, Los Angeles, California, United States of America
- Department of Anesthesiology and Perioperative Medicine, David Geffen School of Medicine, University of California Los Angeles, Los Angeles, California, United States of America
- Institute of Precision Health, University of California Los Angeles, Los Angeles, California, United States of America
- * E-mail: (LB); (EH); (NRG)
| | - Nandita R. Garud
- Department of Human Genetics, David Geffen School of Medicine, University of California Los Angeles, Los Angeles, California, United States of America
- Department of Ecology and Evolutionary Biology, University of California Los Angeles, Los Angeles, California, United States of America
- * E-mail: (LB); (EH); (NRG)
| |
Collapse
|
6
|
Song JM, Xie WZ, Wang S, Guo YX, Koo DH, Kudrna D, Gong C, Huang Y, Feng JW, Zhang W, Zhou Y, Zuccolo A, Long E, Lee S, Talag J, Zhou R, Zhu XT, Yuan D, Udall J, Xie W, Wing RA, Zhang Q, Poland J, Zhang J, Chen LL. Two gap-free reference genomes and a global view of the centromere architecture in rice. MOLECULAR PLANT 2021; 14:1757-1767. [PMID: 34171480 DOI: 10.1016/j.molp.2021.06.018] [Citation(s) in RCA: 115] [Impact Index Per Article: 38.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 05/08/2021] [Revised: 06/16/2021] [Accepted: 06/22/2021] [Indexed: 05/04/2023]
Abstract
Rice (Oryza sativa), a major staple throughout the world and a model system for plant genomics and breeding, was the first crop genome sequenced almost two decades ago. However, reference genomes for all higher organisms to date contain gaps and missing sequences. Here, we report the assembly and analysis of gap-free reference genome sequences for two elite O. sativa xian/indica rice varieties, Zhenshan 97 and Minghui 63, which are being used as a model system for studying heterosis and yield. Gap-free reference genomes provide the opportunity for a global view of the structure and function of centromeres. We show that all rice centromeric regions share conserved centromere-specific satellite motifs with different copy numbers and structures. In addition, the similarity of CentO repeats in the same chromosome is higher than across chromosomes, supporting a model of local expansion and homogenization. Both genomes have over 395 non-TE genes located in centromere regions, of which ∼41% are actively transcribed. Two large structural variants at the end of chromosome 11 affect the copy number of resistance genes between the two genomes. The availability of the two gap-free genomes lays a solid foundation for further understanding genome structure and function in plants and breeding climate-resilient varieties.
Collapse
Affiliation(s)
- Jia-Ming Song
- National Key Laboratory of Crop Genetic Improvement, Huazhong Agricultural University, Wuhan 430070, China; College of Life Science and Technology, Guangxi University, Nanning 530004, China
| | - Wen-Zhao Xie
- National Key Laboratory of Crop Genetic Improvement, Huazhong Agricultural University, Wuhan 430070, China
| | - Shuo Wang
- National Key Laboratory of Crop Genetic Improvement, Huazhong Agricultural University, Wuhan 430070, China
| | - Yi-Xiong Guo
- National Key Laboratory of Crop Genetic Improvement, Huazhong Agricultural University, Wuhan 430070, China
| | - Dal-Hoe Koo
- Wheat Genetics Resource Center, Department of Plant Pathology, Kansas State University, Manhattan, KS, USA
| | - Dave Kudrna
- Arizona Genomics Institute, School of Plant Sciences, University of Arizona, Tucson, AZ 85721, USA
| | - Chenbo Gong
- National Key Laboratory of Crop Genetic Improvement, Huazhong Agricultural University, Wuhan 430070, China
| | - Yicheng Huang
- National Key Laboratory of Crop Genetic Improvement, Huazhong Agricultural University, Wuhan 430070, China
| | - Jia-Wu Feng
- National Key Laboratory of Crop Genetic Improvement, Huazhong Agricultural University, Wuhan 430070, China
| | - Wenhui Zhang
- National Key Laboratory of Crop Genetic Improvement, Huazhong Agricultural University, Wuhan 430070, China
| | - Yong Zhou
- Center for Desert Agriculture, Biological and Environmental Sciences & Engineering Division (BESE), King Abdullah University of Science and Technology (KAUST), Thuwal 23955-6900, Saudi Arabia
| | - Andrea Zuccolo
- Center for Desert Agriculture, Biological and Environmental Sciences & Engineering Division (BESE), King Abdullah University of Science and Technology (KAUST), Thuwal 23955-6900, Saudi Arabia
| | - Evan Long
- Plant and Wildlife Science Department, Brigham Young University, Provo, UT 84602, USA
| | - Seunghee Lee
- Arizona Genomics Institute, School of Plant Sciences, University of Arizona, Tucson, AZ 85721, USA
| | - Jayson Talag
- Arizona Genomics Institute, School of Plant Sciences, University of Arizona, Tucson, AZ 85721, USA
| | - Run Zhou
- National Key Laboratory of Crop Genetic Improvement, Huazhong Agricultural University, Wuhan 430070, China
| | - Xi-Tong Zhu
- National Key Laboratory of Crop Genetic Improvement, Huazhong Agricultural University, Wuhan 430070, China
| | - Daojun Yuan
- National Key Laboratory of Crop Genetic Improvement, Huazhong Agricultural University, Wuhan 430070, China
| | - Joshua Udall
- Plant and Wildlife Science Department, Brigham Young University, Provo, UT 84602, USA
| | - Weibo Xie
- National Key Laboratory of Crop Genetic Improvement, Huazhong Agricultural University, Wuhan 430070, China
| | - Rod A Wing
- Arizona Genomics Institute, School of Plant Sciences, University of Arizona, Tucson, AZ 85721, USA; Center for Desert Agriculture, Biological and Environmental Sciences & Engineering Division (BESE), King Abdullah University of Science and Technology (KAUST), Thuwal 23955-6900, Saudi Arabia; International Rice Research Institute (IRRI), Strategic Innovation, Los Baños, 4031 Laguna, Philippines
| | - Qifa Zhang
- National Key Laboratory of Crop Genetic Improvement, Huazhong Agricultural University, Wuhan 430070, China
| | - Jesse Poland
- Wheat Genetics Resource Center, Department of Plant Pathology, Kansas State University, Manhattan, KS, USA.
| | - Jianwei Zhang
- National Key Laboratory of Crop Genetic Improvement, Huazhong Agricultural University, Wuhan 430070, China.
| | - Ling-Ling Chen
- National Key Laboratory of Crop Genetic Improvement, Huazhong Agricultural University, Wuhan 430070, China; College of Life Science and Technology, Guangxi University, Nanning 530004, China.
| |
Collapse
|
7
|
Palumbo F, Vannozzi A, Barcaccia G. Impact of Genomic and Transcriptomic Resources on Apiaceae Crop Breeding Strategies. Int J Mol Sci 2021; 22:ijms22189713. [PMID: 34575872 PMCID: PMC8465131 DOI: 10.3390/ijms22189713] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/29/2021] [Revised: 09/03/2021] [Accepted: 09/04/2021] [Indexed: 01/18/2023] Open
Abstract
The Apiaceae taxon is one of the most important families of flowering plants and includes thousands of species used for food, flavoring, fragrance, medical and industrial purposes. This study had the specific intent of reviewing the main genomics and transcriptomic data available for this family and their use for the constitution of new varieties. This was achieved starting from the description of the main reproductive systems and barriers, with particular reference to cytoplasmic (CMS) and nuclear (NMS) male sterility. We found that CMS and NMS systems have been discovered and successfully exploited for the development of varieties only in Foeniculum vulgare, Daucus carota, Apium graveolens and Pastinaca sativa; whereas, strategies to limit self-pollination have been poorly considered. Since the constitution of new varieties benefits from the synergistic use of marker-assisted breeding in combination with conventional breeding schemes, we also analyzed and discussed the available SNP and SSR marker datasets (20 species) and genomes (8 species). Furthermore, the RNA-seq studies aimed at elucidating key pathways in stress tolerance or biosynthesis of the metabolites of interest were limited and proportional to the economic weight of each species. Finally, by aligning 53 plastid genomes from as many species as possible, we demonstrated the precision offered by the super barcoding approach to reconstruct the phylogenetic relationships of Apiaceae species. Overall, despite the impressive size of this family, we documented an evident lack of molecular data, especially because genomic and transcriptomic resources are circumscribed to a small number of species. We believe that our contribution can help future studies aimed at developing molecular tools for boosting breeding programs in crop plants of the Apiaceae family.
Collapse
|
8
|
Liao X, Li M, Luo J, Zou Y, Wu FX, Luo F, Wang J. EPGA-SC : A Framework for de novo Assembly of Single-Cell Sequencing Reads. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2021; 18:1492-1503. [PMID: 31603794 DOI: 10.1109/tcbb.2019.2945761] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/10/2023]
Abstract
Assembling genomes from single-cell sequencing data is essential for single-cell studies. However, single-cell assemblies are challenging due to (i) the highly non-uniform read coverage and (ii) the elevated levels of sequencing errors and chimeric reads. Although several assemblers for single-cell data have been proposed in recent years, most of them fail to construct correct long contigs. In this study, we present a new framework called EPGA-SC for de novo assembly of single-cell sequencing reads. The EPGA assembler has designed strategies to solve the problems caused by sequencing errors, sequencing biases, and repetitive regions. However, the extremely unbalanced and richer error types prevent EPGA to achieve high performance in single-cell sequencing data. In this study, we designed EPGA-SC based on EPGA. The main innovations of EPGA-SC are as follows: (i) classifying reads to reduce the proportion of false reads; (ii) using multiple sets of high precision paired-end reads generated from the high precision assemblies produced by other assembler such as SPAdes to overcome the impact of sequencing biases and repetitive regions; and (iii) developing novel algorithms for removing chimeric errors and extending contigs. We test EPGA-SC with seven datasets. The experimental results show that EPGA-SC can generate better assemblies than most current tools in most time in term of MAX contig, N50, NG50, NA50, and NGA50.
Collapse
|
9
|
Yang Z, Li H, Jia Y, Zheng Y, Meng H, Bao T, Li X, Luo L. Intrinsic laws of k-mer spectra of genome sequences and evolution mechanism of genomes. BMC Evol Biol 2020; 20:157. [PMID: 33228538 PMCID: PMC7684957 DOI: 10.1186/s12862-020-01723-3] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/04/2020] [Accepted: 11/10/2020] [Indexed: 11/17/2022] Open
Abstract
Background K-mer spectra of DNA sequences contain important information about sequence composition and sequence evolution. We want to reveal the evolution rules of genome sequences by studying the k-mer spectra of genome sequences. Results The intrinsic laws of k-mer spectra of 920 genome sequences from primate to prokaryote were analyzed. We found that there are two types of evolution selection modes in genome sequences, named as CG Independent Selection and TA Independent Selection. There is a mutual inhibition relationship between CG and TA independent selections. We found that the intensity of CG and TA independent selections correlates closely with genome evolution and G + C content of genome sequences. The living habits of species are related closely to the independent selection modes adopted by species genomes. Consequently, we proposed an evolution mechanism of genomes in which the genome evolution is determined by the intensities of the CG and TA independent selections and the mutual inhibition relationship. Besides, by the evolution mechanism of genomes, we speculated the evolution modes of prokaryotes in mild and extreme environments in the anaerobic age and the evolving process of prokaryotes from anaerobic to aerobic environment on earth as well as the originations of different eukaryotes. Conclusion We found that there are two independent selection modes in genome sequences. The evolution of genome sequence is determined by the two independent selection modes and the mutual inhibition relationship between them.
Collapse
Affiliation(s)
- Zhenhua Yang
- Laboratory of Theoretical Biophysics, School of Physical Science & Technology, Inner Mongolia University, Hohhot, 010021, China.,School of Economics and Management, Inner Mongolia University of Science & Technology, Baotou, 014010, China
| | - Hong Li
- Laboratory of Theoretical Biophysics, School of Physical Science & Technology, Inner Mongolia University, Hohhot, 010021, China.
| | - Yun Jia
- College of Science, Inner Mongolia University of Technology, Hohhot, 010051, China
| | - Yan Zheng
- Baotou Medical College, Inner Mongolia University of Science & Technology, Baotou, 014040, China
| | - Hu Meng
- School of Life Science & Technology, Inner Mongolia University of Science & Technology, Baotou, 014010, China
| | - Tonglaga Bao
- Laboratory of Theoretical Biophysics, School of Physical Science & Technology, Inner Mongolia University, Hohhot, 010021, China
| | - Xiaolong Li
- Laboratory of Theoretical Biophysics, School of Physical Science & Technology, Inner Mongolia University, Hohhot, 010021, China
| | - Liaofu Luo
- Laboratory of Theoretical Biophysics, School of Physical Science & Technology, Inner Mongolia University, Hohhot, 010021, China
| |
Collapse
|
10
|
Jaworski CC, Allan CW, Matzkin LM. Chromosome‐level hybrid de novo genome assemblies as an attainable option for nonmodel insects. Mol Ecol Resour 2020; 20:1277-1293. [DOI: 10.1111/1755-0998.13176] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/24/2019] [Revised: 03/31/2020] [Accepted: 04/16/2020] [Indexed: 11/27/2022]
Affiliation(s)
- Coline C. Jaworski
- Department of Entomology The University of Arizona Tucson AZ USA
- Univ Avignon CNRS IRD IMBE Aix Marseille Université Marseille France
- Department of Zoology University of Oxford Oxford UK
| | - Carson W. Allan
- Department of Entomology The University of Arizona Tucson AZ USA
| | - Luciano M. Matzkin
- Department of Entomology The University of Arizona Tucson AZ USA
- BIO5 Institute The University of Arizona Tucson AZ USA
- Department of Ecology and Evolutionary Biology The University of Arizona Tucson AZ USA
| |
Collapse
|
11
|
Taylor WS, Pearson J, Miller A, Schmeier S, Frizelle FA, Purcell RV. MinION Sequencing of colorectal cancer tumour microbiomes-A comparison with amplicon-based and RNA-Sequencing. PLoS One 2020; 15:e0233170. [PMID: 32433701 PMCID: PMC7239435 DOI: 10.1371/journal.pone.0233170] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/20/2019] [Accepted: 04/29/2020] [Indexed: 01/26/2023] Open
Abstract
BACKGROUND Recent evidence suggests a role for the gut microbiome in the development and progression of many diseases and many studies have been carried out to analyse the microbiome using a variety of methods. In this study, we compare MinION sequencing with meta-transcriptomics and amplicon-based sequencing for microbiome analysis of colorectal tumour tissue samples. METHODS DNA and RNA were extracted from 11 colorectal tumour samples. 16S rRNA amplicon sequencing and MinION sequencing was carried out using genomic DNA, and RNA-Sequencing for meta-transcriptomic analysis. Non-human MinION and RNA-Sequencing reads, and 16S rRNA amplicon sequencing reads were taxonomically classified using a database built from available RefSeq bacterial and archaeal genomes and a k-mer based algorithm in Kraken2. Concordance between the three platforms at different taxonomic levels was tested on a per-sample basis using Spearman's rank correlation. RESULTS The average number of reads per sample using RNA-Sequencing was greater than 129 times that generated using MinION sequencing. However, the average read length of MinION sequences was more than 13 times that of RNA or 16S rRNA amplicon sequencing. Taxonomic assignment using 16S sequencing was less reliable beyond the genus level, and both RNA-Sequencing and MinION sequencing could detect greater numbers of phyla and genera in the same samples, compared to 16S sequencing. Bacterial species associated with colorectal cancer, Fusobacterium nucleatum, Parvimonas micra, Bacteroides fragilis and Porphyromonas gingivalis, were detectable using MinION, RNA-Sequencing and 16S rRNA amplicon sequencing data. CONCLUSIONS Long-read sequences generated using MinION sequencing can compensate for low numbers of reads for bacterial classification. MinION sequencing can discriminate between bacterial strains and plasmids and shows potential as a cost-effective tool for rapid microbiome sequencing in a clinical setting.
Collapse
Affiliation(s)
- William S. Taylor
- Department of Surgery, University of Otago, Christchurch, New Zealand
| | - John Pearson
- Biostatistics and Computational Biology Unit, University of Otago, Christchurch, New Zealand
| | - Allison Miller
- Gene Structure and Function Laboratory, University of Otago, Christchurch, New Zealand
| | - Sebastian Schmeier
- Institute of Natural and Mathematical Sciences, Massey University, Auckland, New Zealand
| | - Frank A. Frizelle
- Department of Surgery, University of Otago, Christchurch, New Zealand
| | - Rachel V. Purcell
- Department of Surgery, University of Otago, Christchurch, New Zealand
| |
Collapse
|
12
|
Li J, Huang Y, Zhou Y. A Mini-review of the Computational Methods Used in Identifying RNA 5-Methylcytosine Sites. Curr Genomics 2020; 21:3-10. [PMID: 32655293 PMCID: PMC7324889 DOI: 10.2174/2213346107666200219124951] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/24/2019] [Revised: 01/17/2020] [Accepted: 01/31/2020] [Indexed: 01/10/2023] Open
Abstract
RNA 5-methylcytosine (m5C) is one of the pillars of post-transcriptional modification (PTCM). A growing body of evidence suggests that m5C plays a vital role in RNA metabolism. Accurate localization of RNA m5C sites in tissue cells is the premise and basis for the in-depth understanding of the functions of m5C. However, the main experimental methods of detecting m5C sites are limited to varying degrees. Establishing a computational model to predict modification sites is an excellent complement to wet experiments for identifying m5C sites. In this review, we summarized some available m5C predictors and discussed the characteristics of these methods.
Collapse
Affiliation(s)
- Jianwei Li
- 1Institute of Computational Medicine, School of Artificial Intelligence, Hebei University of Technology, Tianjin, China; 2Department of Biomedical Informatics, School of Basic Medical Sciences, Center for Noncoding RNA Medicine, Peking University, Beijing, China
| | - Yan Huang
- 1Institute of Computational Medicine, School of Artificial Intelligence, Hebei University of Technology, Tianjin, China; 2Department of Biomedical Informatics, School of Basic Medical Sciences, Center for Noncoding RNA Medicine, Peking University, Beijing, China
| | - Yuan Zhou
- 1Institute of Computational Medicine, School of Artificial Intelligence, Hebei University of Technology, Tianjin, China; 2Department of Biomedical Informatics, School of Basic Medical Sciences, Center for Noncoding RNA Medicine, Peking University, Beijing, China
| |
Collapse
|
13
|
Liao X, Li M, Luo J, Zou Y, Wu FX, Pan Y, Luo F, Wang J. Improving de novo Assembly Based on Read Classification. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2020; 17:177-188. [PMID: 30059317 DOI: 10.1109/tcbb.2018.2861380] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/08/2023]
Abstract
Due to sequencing bias, sequencing error, and repeat problems, the genome assemblies usually contain misarrangements and gaps. When tackling these problems, current assemblers commonly consider the read libraries as a whole and adopt the same strategy to deal with them. However, if we can divide reads into different categories and take different assembly strategies for different read categories, we expect to reduce the mutual effects on problems in genome assembly and facilitate to produce satisfactory assemblies. In this paper, we present a new pipeline for genome assembly based on read classification (ARC). ARC classifies reads into three categories according to the frequencies of k-mers they contain. The three categories refer to (1) low depth reads, which contain a certain low frequency k-mers and are often caused by sequencing errors or bias; (2) high depth reads, which contain a certain high frequency k-mers and usually come from repetitive regions; and (3) normal depth reads, which are the rest of reads. After read classification, an existing assembler is used to assemble different read categories separately, which is beneficial to resolve problems in the genome assembly. ARC adopts loose assembly parameters for low depth reads, and strict assembly parameters for normal depth and high depth reads. We test ARC using five datasets. The experimental results show that, assemblers combining with ARC can generate better assemblies in terms of NA50, NGA50, and genome fraction.
Collapse
|
14
|
Jung H, Winefield C, Bombarely A, Prentis P, Waterhouse P. Tools and Strategies for Long-Read Sequencing and De Novo Assembly of Plant Genomes. TRENDS IN PLANT SCIENCE 2019; 24:700-724. [PMID: 31208890 DOI: 10.1016/j.tplants.2019.05.003] [Citation(s) in RCA: 49] [Impact Index Per Article: 9.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 01/06/2019] [Revised: 05/01/2019] [Accepted: 05/10/2019] [Indexed: 05/16/2023]
Abstract
The commercial release of third-generation sequencing technologies (TGSTs), giving long and ultra-long sequencing reads, has stimulated the development of new tools for assembling highly contiguous genome sequences with unprecedented accuracy across complex repeat regions. We survey here a wide range of emerging sequencing platforms and analytical tools for de novo assembly, provide background information for each of their steps, and discuss the spectrum of available options. Our decision tree recommends workflows for the generation of a high-quality genome assembly when used in combination with the specific needs and resources of a project.
Collapse
Affiliation(s)
- Hyungtaek Jung
- Centre for Tropical Crops and Biocommodities, Queensland University of Technology, Brisbane, QLD 4001, Australia.
| | - Christopher Winefield
- Department of Wine, Food, and Molecular Biosciences, Lincoln University, 7647 Christchurch, New Zealand
| | - Aureliano Bombarely
- Department of Bioscience, University of Milan, Milan 20133, Italy; School of Plants and Environmental Sciences, Virginia Tech, Blacksburg, VA 24061, USA
| | - Peter Prentis
- School of Earth, Environmental, and Biological Sciences, Queensland University of Technology, Brisbane, QLD, 4001, Australia
| | - Peter Waterhouse
- Centre for Tropical Crops and Biocommodities, Queensland University of Technology, Brisbane, QLD 4001, Australia; School of Biological Sciences, University of Sydney, Sydney, NSW 2006, Australia.
| |
Collapse
|
15
|
A New Portrait of Constitutive Heterochromatin: Lessons from Drosophila melanogaster. Trends Genet 2019; 35:615-631. [PMID: 31320181 DOI: 10.1016/j.tig.2019.06.002] [Citation(s) in RCA: 42] [Impact Index Per Article: 8.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/04/2019] [Revised: 06/05/2019] [Accepted: 06/06/2019] [Indexed: 12/14/2022]
Abstract
Constitutive heterochromatin represents a significant portion of eukaryotic genomes, but its functions still need to be elucidated. Even in the most updated genetics and molecular biology textbooks, constitutive heterochromatin is portrayed mainly as the 'silent' component of eukaryotic genomes. However, there may be more complexity to the relationship between heterochromatin and gene expression. In the fruit fly Drosophila melanogaster, a model for heterochromatin studies, about one-third of the genome is heterochromatic and is concentrated in the centric, pericentric, and telomeric regions of the chromosomes. Recent findings indicate that hundreds of D. melanogaster genes can 'live and work' properly within constitutive heterochromatin. The genomic size of these genes is generally larger than that of euchromatic genes and together they account for a significant fraction of the entire constitutive heterochromatin. Thus, this peculiar genome component in spite its ability to induce silencing, has in fact the means for being quite dynamic. A major scope of this review is to revisit the 'dogma of silent heterochromatin'.
Collapse
|
16
|
Zhang H, Li Y, Zhu JK. Developing naturally stress-resistant crops for a sustainable agriculture. NATURE PLANTS 2018; 4:989-996. [PMID: 30478360 DOI: 10.1038/s41477-018-0309-4] [Citation(s) in RCA: 110] [Impact Index Per Article: 18.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 12/12/2017] [Accepted: 10/17/2018] [Indexed: 05/19/2023]
Abstract
A major problem facing humanity is that our numbers are growing but the availability of land and fresh water for agriculture is not. This problem is being exacerbated by climate change-induced increases in drought, and other abiotic stresses. Stress-resistant crops are needed to ensure yield stability under stress conditions and to minimize the environmental impacts of crop production. Evolution has created thousands of species of naturally stress-resistant plants (NSRPs), some of which have already been subjected to human domestication and are considered minor crops. Broader cultivation of these minor crops will diversify plant agriculture and the human diet, and will therefore help improve global food security and human health. More research should be directed toward understanding and utilizing NSRPs. Technologies are now available that will enable researchers to rapidly improve the genetics of NSRPs, with the goal of increasing NSRP productivity while retaining NSRP stress resistance and nutritional value.
Collapse
Affiliation(s)
- Heng Zhang
- Shanghai Center for Plant Stress Biology, Center for Excellence in Molecular Plant Sciences, Chinese Academy of Sciences, Shanghai, China.
- National Key Laboratory of Plant Molecular Genetics, Center for Excellence in Molecular Plant Sciences, Chinese Academy of Sciences, Shanghai, China.
| | - Yuanyuan Li
- Key Laboratory of Plant Stress Research, Shandong Normal University, Jinan, Shandong, China
| | - Jian-Kang Zhu
- Shanghai Center for Plant Stress Biology, Center for Excellence in Molecular Plant Sciences, Chinese Academy of Sciences, Shanghai, China.
- Department of Horticulture and Landscape Architecture, Purdue University, West Lafayette, IN, USA.
| |
Collapse
|
17
|
Kaisers W, Schwender H, Schaal H. Hierarchical Clustering of DNA k-mer Counts in RNAseq Fastq Files Identifies Sample Heterogeneities. Int J Mol Sci 2018; 19:E3687. [PMID: 30469355 PMCID: PMC6274891 DOI: 10.3390/ijms19113687] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/05/2018] [Accepted: 11/15/2018] [Indexed: 01/14/2023] Open
Abstract
We apply hierarchical clustering (HC) of DNA k-mer counts on multiple Fastq files. The tree structures produced by HC may reflect experimental groups and thereby indicate experimental effects, but clustering of preparation groups indicates the presence of batch effects. Hence, HC of DNA k-mer counts may serve as a diagnostic device. In order to provide a simple applicable tool we implemented sequential analysis of Fastq reads with low memory usage in an R package (seqTools) available on Bioconductor. The approach is validated by analysis of Fastq file batches containing RNAseq data. Analysis of three Fastq batches downloaded from ArrayExpress indicated experimental effects. Analysis of RNAseq data from two cell types (dermal fibroblasts and Jurkat cells) sequenced in our facility indicate presence of batch effects. The observed batch effects were also present in reads mapped to the human genome and also in reads filtered for high quality (Phred > 30). We propose, that hierarchical clustering of DNA k-mer counts provides an unspecific diagnostic tool for RNAseq experiments. Further exploration is required once samples are identified as outliers in HC derived trees.
Collapse
Affiliation(s)
- Wolfgang Kaisers
- Department of Anaesthesiology, HELIOS University Hospital Wuppertal, University of Witten/Herdecke, Heusnerstr. 40, 42283 Wuppertal, Germany.
- Institut fur Virologie, University Hospital Düsseldorf, Heinrich Heine University Düsseldorf, 40225 Düsseldorf, Germany.
| | - Holger Schwender
- Mathematisches Institut, Heinrich-Heine-Universität Düsseldorf, 40225 Düsseldorf, Germany.
| | - Heiner Schaal
- Institut fur Virologie, University Hospital Düsseldorf, Heinrich Heine University Düsseldorf, 40225 Düsseldorf, Germany.
| |
Collapse
|
18
|
Heterochromatin-Enriched Assemblies Reveal the Sequence and Organization of the Drosophila melanogaster Y Chromosome. Genetics 2018; 211:333-348. [PMID: 30420487 PMCID: PMC6325706 DOI: 10.1534/genetics.118.301765] [Citation(s) in RCA: 67] [Impact Index Per Article: 11.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/04/2018] [Accepted: 11/05/2018] [Indexed: 12/21/2022] Open
Abstract
Heterochromatic regions of the genome are repeat-rich and poor in protein coding genes, and are therefore underrepresented in even the best genome assemblies. One of the most difficult regions of the genome to assemble are sex-limited chromosomes. The Drosophila melanogaster Y chromosome is entirely heterochromatic, yet has wide-ranging effects on male fertility, fitness, and genome-wide gene expression. The genetic basis of this phenotypic variation is difficult to study, in part because we do not know the detailed organization of the Y chromosome. To study Y chromosome organization in D. melanogaster, we develop an assembly strategy involving the in silico enrichment of heterochromatic long single-molecule reads and use these reads to create targeted de novo assemblies of heterochromatic sequences. We assigned contigs to the Y chromosome using Illumina reads to identify male-specific sequences. Our pipeline extends the D. melanogaster reference genome by 11.9 Mb, closes 43.8% of the gaps, and improves overall contiguity. The addition of 10.6 MB of Y-linked sequence permitted us to study the organization of repeats and genes along the Y chromosome. We detected a high rate of duplication to the pericentric regions of the Y chromosome from other regions in the genome. Most of these duplicated genes exist in multiple copies. We detail the evolutionary history of one sex-linked gene family, crystal-Stellate While the Y chromosome does not undergo crossing over, we observed high gene conversion rates within and between members of the crystal-Stellate gene family, Su(Ste), and PCKR, compared to genome-wide estimates. Our results suggest that gene conversion and gene duplication play an important role in the evolution of Y-linked genes.
Collapse
|
19
|
Subirana JA, Messeguer X. How Long Are Long Tandem Repeats? A Challenge for Current Methods of Whole-Genome Sequence Assembly: The Case of Satellites in Caenorhabditis elegans. Genes (Basel) 2018; 9:genes9100500. [PMID: 30332836 PMCID: PMC6210790 DOI: 10.3390/genes9100500] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/18/2018] [Revised: 10/11/2018] [Accepted: 10/11/2018] [Indexed: 11/16/2022] Open
Abstract
Repetitive genome regions have been difficult to sequence, mainly because of the comparatively small size of the fragments used in assembly. Satellites or tandem repeats are very abundant in nematodes and offer an excellent playground to evaluate different assembly methods. Here, we compare the structure of satellites found in three different assemblies of the Caenorhabditis elegans genome: the original sequence obtained by Sanger sequencing, an assembly based on PacBio technology, and an assembly using Nanopore sequencing reads. In general, satellites were found in equivalent genomic regions, but the new long-read methods (PacBio and Nanopore) tended to result in longer assembled satellites. Important differences exist between the assemblies resulting from the two long-read technologies, such as the sizes of long satellites. Our results also suggest that the lengths of some annotated genes with internal repeats which were assembled using Sanger sequencing are likely to be incorrect.
Collapse
Affiliation(s)
- Juan A Subirana
- Department of Computer Science, Universitat Politècnica de Catalunya, Jordi Girona 1-3, 08034 Barcelona, Spain.
- Evolutionary Genomics Group, Research Program on Biomedical Informatics (GRIB)⁻Hospital del Mar Research Institute (IMIM), Universitat Pompeu Fabra (UPF), Dr. Aiguader 86, 08003 Barcelona, Spain.
| | - Xavier Messeguer
- Department of Computer Science, Universitat Politècnica de Catalunya, Jordi Girona 1-3, 08034 Barcelona, Spain.
| |
Collapse
|
20
|
Choudhury O, Chakrabarty A, Emrich SJ. HECIL: A Hybrid Error Correction Algorithm for Long Reads with Iterative Learning. Sci Rep 2018; 8:9936. [PMID: 29967328 PMCID: PMC6028576 DOI: 10.1038/s41598-018-28364-3] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/29/2018] [Accepted: 05/31/2018] [Indexed: 11/22/2022] Open
Abstract
Second-generation DNA sequencing techniques generate short reads that can result in fragmented genome assemblies. Third-generation sequencing platforms mitigate this limitation by producing longer reads that span across complex and repetitive regions. However, the usefulness of such long reads is limited because of high sequencing error rates. To exploit the full potential of these longer reads, it is imperative to correct the underlying errors. We propose HECIL-Hybrid Error Correction with Iterative Learning-a hybrid error correction framework that determines a correction policy for erroneous long reads, based on optimal combinations of decision weights obtained from short read alignments. We demonstrate that HECIL outperforms state-of-the-art error correction algorithms for an overwhelming majority of evaluation metrics on diverse, real-world data sets including E. coli, S. cerevisiae, and the malaria vector mosquito A. funestus. Additionally, we provide an optional avenue of improving the performance of HECIL's core algorithm by introducing an iterative learning paradigm that enhances the correction policy at each iteration by incorporating knowledge gathered from previous iterations via data-driven confidence metrics assigned to prior corrections.
Collapse
Affiliation(s)
- Olivia Choudhury
- Postdoctoral Researcher, IBM Research, Cambridge, MA, 02142, USA.
| | - Ankush Chakrabarty
- Visiting Research Scientist, Mitsubishi Electric Research Laboratories, Cambridge, MA, 02139, USA
| | - Scott J Emrich
- Associate Professor, Department of Electrical Engineering and Computer Science, University of Tennessee, Knoxville, TN, 37996, USA
| |
Collapse
|
21
|
Zhao X. Validation of Genomic Structural Variants Through Long Sequencing Technologies. Methods Mol Biol 2018; 1833:187-192. [PMID: 30039374 DOI: 10.1007/978-1-4939-8666-8_15] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/08/2023]
Abstract
Although numerous algorithms have been developed to identify large chromosomal rearrangements (i.e., genomic structural variants, SVs), there remains a dearth of approaches to evaluate their results. This is significant, as the accurate identification of SVs is still an outstanding problem whereby no single algorithm has been shown to be able to achieve high sensitivity and specificity across different classes of SVs. The method introduced in this chapter, VaPoR, is specifically designed to evaluate the accuracy of SV predictions using third-generation long sequences. This method uses a recurrence approach and collects direct evidence from raw reads thus avoiding computationally costly whole genome assembly. This chapter would describe in detail as how to apply this tool onto different data types.
Collapse
Affiliation(s)
- Xuefang Zhao
- Center for Genomic Medicine at Massachusetts General Hospital, Boston, MA, USA.
| |
Collapse
|
22
|
Zhao X, Weber AM, Mills RE. A recurrence-based approach for validating structural variation using long-read sequencing technology. Gigascience 2017; 6:1-9. [PMID: 28873962 PMCID: PMC5737365 DOI: 10.1093/gigascience/gix061] [Citation(s) in RCA: 19] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/27/2017] [Revised: 05/15/2017] [Accepted: 07/09/2017] [Indexed: 11/23/2022] Open
Abstract
Although numerous algorithms have been developed to identify structural variations (SVs) in genomic sequences, there is a dearth of approaches that can be used to evaluate their results. This is significant as the accurate identification of structural variation is still an outstanding but important problem in genomics. The emergence of new sequencing technologies that generate longer sequence reads can, in theory, provide direct evidence for all types of SVs regardless of the length of the region through which it spans. However, current efforts to use these data in this manner require the use of large computational resources to assemble these sequences as well as visual inspection of each region. Here we present VaPoR, a highly efficient algorithm that autonomously validates large SV sets using long-read sequencing data. We assessed the performance of VaPoR on SVs in both simulated and real genomes and report a high-fidelity rate for overall accuracy across different levels of sequence depths. We show that VaPoR can interrogate a much larger range of SVs while still matching existing methods in terms of false positive validations and providing additional features considering breakpoint precision and predicted genotype. We further show that VaPoR can run quickly and efficiency without requiring a large processing or assembly pipeline. VaPoR provides a long read-based validation approach for genomic SVs that requires relatively low read depth and computing resources and thus will provide utility with targeted or low-pass sequencing coverage for accurate SV assessment. The VaPoR Software is available at: https://github.com/mills-lab/vapor.
Collapse
Affiliation(s)
- Xuefang Zhao
- Department of Computational Medicine and Bioinformatics, University of
Michigan, 100 Washtenaw Ave, Ann Arbor, MI 48109, USA
| | - Alexandra M. Weber
- Department of Computational Medicine and Bioinformatics, University of
Michigan, 100 Washtenaw Ave, Ann Arbor, MI 48109, USA
| | - Ryan E. Mills
- Department of Computational Medicine and Bioinformatics, University of
Michigan, 100 Washtenaw Ave, Ann Arbor, MI 48109, USA
- Department of Human Genetics, University of Michigan, 1241 Catherine St, Ann
Arbor, MI 48109, USA
| |
Collapse
|
23
|
Pandey P, Bender MA, Johnson R, Patro R. deBGR: an efficient and near-exact representation of the weighted de Bruijn graph. Bioinformatics 2017; 33:i133-i141. [PMID: 28881995 PMCID: PMC5870571 DOI: 10.1093/bioinformatics/btx261] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/11/2022] Open
Abstract
MOTIVATION Almost all de novo short-read genome and transcriptome assemblers start by building a representation of the de Bruijn Graph of the reads they are given as input. Even when other approaches are used for subsequent assembly (e.g. when one is using 'long read' technologies like those offered by PacBio or Oxford Nanopore), efficient k -mer processing is still crucial for accurate assembly, and state-of-the-art long-read error-correction methods use de Bruijn Graphs. Because of the centrality of de Bruijn Graphs, researchers have proposed numerous methods for representing de Bruijn Graphs compactly. Some of these proposals sacrifice accuracy to save space. Further, none of these methods store abundance information, i.e. the number of times that each k -mer occurs, which is key in transcriptome assemblers. RESULTS We present a method for compactly representing the weighted de Bruijn Graph (i.e. with abundance information) with essentially no errors. Our representation yields zero errors while increasing the space requirements by less than 18-28% compared to the approximate de Bruijn graph representation in Squeakr. Our technique is based on a simple invariant that all weighted de Bruijn Graphs must satisfy, and hence is likely to be of general interest and applicable in most weighted de Bruijn Graph-based systems. AVAILABILITY AND IMPLEMENTATION https://github.com/splatlab/debgr . CONTACT rob.patro@cs.stonybrook.edu. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Prashant Pandey
- Department of Computer Science, Stony Brook University, Stony Brook, NY, USA
| | - Michael A Bender
- Department of Computer Science, Stony Brook University, Stony Brook, NY, USA
| | - Rob Johnson
- Department of Computer Science, Stony Brook University, Stony Brook, NY, USA
- VMWare, Inc., Palo Alto, CA
| | - Rob Patro
- Department of Computer Science, Stony Brook University, Stony Brook, NY, USA
| |
Collapse
|