1
|
Yang C, Yang Y, Chu G, Wang R, Li H, Mao Y, Wang M, Zhang J, Liao X, Ma H. AutoESDCas: A Web-Based Tool for the Whole-Workflow Editing Sequence Design for Microbial Genome Editing Based on the CRISPR/Cas System. ACS Synth Biol 2024; 13:1737-1749. [PMID: 38845097 DOI: 10.1021/acssynbio.4c00063] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/22/2024]
Abstract
Genome editing is the basis for the modification of engineered microbes. In the process of genome editing, the design of editing sequences, such as primers and sgRNA, is very important for the accurate positioning of editing sites and efficient sequence editing. The whole process of genome editing involves multiple rounds and types of editing sequence design, while the development of related whole-workflow design tools for high-throughput experimental requirements lags. Here, we propose AutoESDCas, an online tool for the end-to-end editing sequence design for microbial genome editing based on the CRISPR/Cas system. This tool facilitates all types of genetic manipulation covering diverse experimental requirements and design scenarios, enables biologists to quickly and efficiently obtain all editing sequences needed for the entire genome editing process, and empowers high-throughput strain modification. Notably, with its off-target risk assessment function for editing sequences, the usability of the design results is significantly improved. AutoESDCas is freely available at https://autoesdcas.biodesign.ac.cn/with the source code at https://github.com/tibbdc/AutoESDCas/.
Collapse
Affiliation(s)
- Chunhe Yang
- College of Biotechnology, Tianjin University of Science and Technology, Tianjin 300457, China
- Biodesign Center, Key Laboratory of Engineering Biology for Low-carbon Manufacturing, Tianjin Institute of Industrial Biotechnology, Chinese Academy of Sciences, Tianjin 300308, China
- National Technology Innovation Center of Synthetic Biology, Tianjin 300308, China
| | - Yi Yang
- Biodesign Center, Key Laboratory of Engineering Biology for Low-carbon Manufacturing, Tianjin Institute of Industrial Biotechnology, Chinese Academy of Sciences, Tianjin 300308, China
- National Technology Innovation Center of Synthetic Biology, Tianjin 300308, China
| | - Guangyun Chu
- National Technology Innovation Center of Synthetic Biology, Tianjin 300308, China
- Tianjin Institute of Industrial Biotechnology, Chinese Academy of Sciences, Tianjin 300308, China
- College of Food Science and Engineering, Tianjin University of Science and Technology, Tianjin 300457, China
| | - Ruoyu Wang
- Biodesign Center, Key Laboratory of Engineering Biology for Low-carbon Manufacturing, Tianjin Institute of Industrial Biotechnology, Chinese Academy of Sciences, Tianjin 300308, China
- National Technology Innovation Center of Synthetic Biology, Tianjin 300308, China
| | - Haoran Li
- Biodesign Center, Key Laboratory of Engineering Biology for Low-carbon Manufacturing, Tianjin Institute of Industrial Biotechnology, Chinese Academy of Sciences, Tianjin 300308, China
- National Technology Innovation Center of Synthetic Biology, Tianjin 300308, China
| | - Yufeng Mao
- National Technology Innovation Center of Synthetic Biology, Tianjin 300308, China
- Tianjin Institute of Industrial Biotechnology, Chinese Academy of Sciences, Tianjin 300308, China
| | - Meng Wang
- National Technology Innovation Center of Synthetic Biology, Tianjin 300308, China
- Tianjin Institute of Industrial Biotechnology, Chinese Academy of Sciences, Tianjin 300308, China
| | - Jian Zhang
- College of Biotechnology, Tianjin University of Science and Technology, Tianjin 300457, China
| | - Xiaoping Liao
- Haihe Laboratory of Synthetic Biology, 300308 Tianjin, China
- Biodesign Center, Key Laboratory of Engineering Biology for Low-carbon Manufacturing, Tianjin Institute of Industrial Biotechnology, Chinese Academy of Sciences, Tianjin 300308, China
- National Technology Innovation Center of Synthetic Biology, Tianjin 300308, China
| | - Hongwu Ma
- Biodesign Center, Key Laboratory of Engineering Biology for Low-carbon Manufacturing, Tianjin Institute of Industrial Biotechnology, Chinese Academy of Sciences, Tianjin 300308, China
- National Technology Innovation Center of Synthetic Biology, Tianjin 300308, China
| |
Collapse
|
2
|
Mokhtar MM, Alsamman AM, El Allali A. MegaSSR: a web server for large scale microsatellite identification, classification, and marker development. FRONTIERS IN PLANT SCIENCE 2023; 14:1219055. [PMID: 38162302 PMCID: PMC10757629 DOI: 10.3389/fpls.2023.1219055] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 05/08/2023] [Accepted: 08/18/2023] [Indexed: 01/03/2024]
Abstract
Next-generation sequencing technologies have opened new avenues for using genomic data to study and develop molecular markers and improve genetic resources. Simple Sequence Repeats (SSRs) as genetic markers are increasingly used in molecular diversity and molecular breeding programs that require bioinformatics pipelines to analyze the large amounts of data. Therefore, there is an ongoing need for online tools that provide computational resources with minimal effort and maximum efficiency, including automated development of SSR markers. These tools should be flexible, customizable, and able to handle the ever-increasing amount of genomic data. Here we introduce MegaSSR (https://bioinformatics.um6p.ma/MegaSSR), a web server and a standalone pipeline that enables the design of SSR markers in any target genome. MegaSSR allows users to design targeted PCR-based primers for their selected SSR repeats and includes multiple tools that initiate computational pipelines for SSR mining, classification, comparisons, PCR primer design, in silico PCR validation, and statistical visualization. MegaSSR results can be accessed, searched, downloaded, and visualized with user-friendly web-based tools. These tools provide graphs and tables showing various aspects of SSR markers and corresponding PCR primers. MegaSSR will accelerate ongoing research in plant species and assist breeding programs in their efforts to improve current genomic resources.
Collapse
Affiliation(s)
- Morad M. Mokhtar
- Bioinformatics Laboratory, College of Computing, Mohammed VI Polytechnic University, Benguerir, Morocco
- Agricultural Genetic Engineering Research Institute, Agricultural Research Center, Giza, Egypt
| | - Alsamman M. Alsamman
- Bioinformatics Laboratory, College of Computing, Mohammed VI Polytechnic University, Benguerir, Morocco
- Agricultural Genetic Engineering Research Institute, Agricultural Research Center, Giza, Egypt
- Biotechnology Department, International Center for Agricultural Research in the Dry Areas (ICARDA), Giza, Egypt
| | - Achraf El Allali
- Bioinformatics Laboratory, College of Computing, Mohammed VI Polytechnic University, Benguerir, Morocco
| |
Collapse
|
3
|
Liao X, Zhu W, Zhou J, Li H, Xu X, Zhang B, Gao X. Repetitive DNA sequence detection and its role in the human genome. Commun Biol 2023; 6:954. [PMID: 37726397 PMCID: PMC10509279 DOI: 10.1038/s42003-023-05322-y] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/29/2023] [Accepted: 09/04/2023] [Indexed: 09/21/2023] Open
Abstract
Repetitive DNA sequences playing critical roles in driving evolution, inducing variation, and regulating gene expression. In this review, we summarized the definition, arrangement, and structural characteristics of repeats. Besides, we introduced diverse biological functions of repeats and reviewed existing methods for automatic repeat detection, classification, and masking. Finally, we analyzed the type, structure, and regulation of repeats in the human genome and their role in the induction of complex diseases. We believe that this review will facilitate a comprehensive understanding of repeats and provide guidance for repeat annotation and in-depth exploration of its association with human diseases.
Collapse
Affiliation(s)
- Xingyu Liao
- Computational Bioscience Research Center (CBRC), Computer, Electrical and Mathematical Sciences and Engineering Division, King Abdullah University of Science and Technology (KAUST), Thuwal, 23955, Saudi Arabia
| | - Wufei Zhu
- Department of Endocrinology, Yichang Central People's Hospital, The First College of Clinical Medical Science, China Three Gorges University, 443000, Yichang, P.R. China
| | - Juexiao Zhou
- Computational Bioscience Research Center (CBRC), Computer, Electrical and Mathematical Sciences and Engineering Division, King Abdullah University of Science and Technology (KAUST), Thuwal, 23955, Saudi Arabia
| | - Haoyang Li
- Computational Bioscience Research Center (CBRC), Computer, Electrical and Mathematical Sciences and Engineering Division, King Abdullah University of Science and Technology (KAUST), Thuwal, 23955, Saudi Arabia
| | - Xiaopeng Xu
- Computational Bioscience Research Center (CBRC), Computer, Electrical and Mathematical Sciences and Engineering Division, King Abdullah University of Science and Technology (KAUST), Thuwal, 23955, Saudi Arabia
| | - Bin Zhang
- Computational Bioscience Research Center (CBRC), Computer, Electrical and Mathematical Sciences and Engineering Division, King Abdullah University of Science and Technology (KAUST), Thuwal, 23955, Saudi Arabia
| | - Xin Gao
- Computational Bioscience Research Center (CBRC), Computer, Electrical and Mathematical Sciences and Engineering Division, King Abdullah University of Science and Technology (KAUST), Thuwal, 23955, Saudi Arabia.
| |
Collapse
|
4
|
Wilton R, Szalay AS. Short-read aligner performance in germline variant identification. Bioinformatics 2023; 39:btad480. [PMID: 37527006 PMCID: PMC10421969 DOI: 10.1093/bioinformatics/btad480] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/07/2023] [Revised: 06/01/2023] [Accepted: 07/31/2023] [Indexed: 08/03/2023] Open
Abstract
MOTIVATION Read alignment is an essential first step in the characterization of DNA sequence variation. The accuracy of variant-calling results depends not only on the quality of read alignment and variant-calling software but also on the interaction between these complex software tools. RESULTS In this review, we evaluate short-read aligner performance with the goal of optimizing germline variant-calling accuracy. We examine the performance of three general-purpose short-read aligners-BWA-MEM, Bowtie 2, and Arioc-in conjunction with three germline variant callers: DeepVariant, FreeBayes, and GATK HaplotypeCaller. We discuss the behavior of the read aligners with regard to the data elements on which the variant callers rely, and illustrate how the runtime configurations of these software tools combine to affect variant-calling performance. AVAILABILITY AND IMPLEMENTATION The quick brown fox jumps over the lazy dog.
Collapse
Affiliation(s)
- Richard Wilton
- Department of Physics and Astronomy, Johns Hopkins University, Baltimore, MD 21218, United States
| | - Alexander S Szalay
- Department of Physics and Astronomy, Johns Hopkins University, Baltimore, MD 21218, United States
- Department of Computer Science, Johns Hopkins University, Baltimore, MD 21218, United States
| |
Collapse
|
5
|
Zürcher JF, Kleefeldt AA, Funke LFH, Birnbaum J, Fredens J, Grazioli S, Liu KC, Spinck M, Petris G, Murat P, Rehm FBH, Sale JE, Chin JW. Continuous synthesis of E. coli genome sections and Mb-scale human DNA assembly. Nature 2023; 619:555-562. [PMID: 37380776 PMCID: PMC7614783 DOI: 10.1038/s41586-023-06268-1] [Citation(s) in RCA: 6] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/20/2022] [Accepted: 05/26/2023] [Indexed: 06/30/2023]
Abstract
Whole-genome synthesis provides a powerful approach for understanding and expanding organism function1-3. To build large genomes rapidly, scalably and in parallel, we need (1) methods for assembling megabases of DNA from shorter precursors and (2) strategies for rapidly and scalably replacing the genomic DNA of organisms with synthetic DNA. Here we develop bacterial artificial chromosome (BAC) stepwise insertion synthesis (BASIS)-a method for megabase-scale assembly of DNA in Escherichia coli episomes. We used BASIS to assemble 1.1 Mb of human DNA containing numerous exons, introns, repetitive sequences, G-quadruplexes, and long and short interspersed nuclear elements (LINEs and SINEs). BASIS provides a powerful platform for building synthetic genomes for diverse organisms. We also developed continuous genome synthesis (CGS)-a method for continuously replacing sequential 100 kb stretches of the E. coli genome with synthetic DNA; CGS minimizes crossovers1,4 between the synthetic DNA and the genome such that the output for each 100 kb replacement provides, without sequencing, the input for the next 100 kb replacement. Using CGS, we synthesized a 0.5 Mb section of the E. coli genome-a key intermediate in its total synthesis1-from five episomes in 10 days. By parallelizing CGS and combining it with rapid oligonucleotide synthesis and episome assembly5,6, along with rapid methods for compiling a single genome from strains bearing distinct synthetic genome sections1,7,8, we anticipate that it will be possible to synthesize entire E. coli genomes from functional designs in less than 2 months.
Collapse
Affiliation(s)
- Jérôme F Zürcher
- Medical Research Council Laboratory of Molecular Biology, Cambridge, UK
| | - Askar A Kleefeldt
- Medical Research Council Laboratory of Molecular Biology, Cambridge, UK
| | - Louise F H Funke
- Medical Research Council Laboratory of Molecular Biology, Cambridge, UK
- Department of Biomedical Engineering, National University of Singapore, Singapore, Singapore
| | - Jakob Birnbaum
- Medical Research Council Laboratory of Molecular Biology, Cambridge, UK
| | - Julius Fredens
- Medical Research Council Laboratory of Molecular Biology, Cambridge, UK
- Synthetic Biology for Clinical and Technological Innovation, Department of Biochemistry, National University of Singapore, Singapore, Singapore
| | - Simona Grazioli
- Medical Research Council Laboratory of Molecular Biology, Cambridge, UK
| | - Kim C Liu
- Medical Research Council Laboratory of Molecular Biology, Cambridge, UK
| | - Martin Spinck
- Medical Research Council Laboratory of Molecular Biology, Cambridge, UK
| | - Gianluca Petris
- Medical Research Council Laboratory of Molecular Biology, Cambridge, UK
- Wellcome Sanger Institute, Wellcome Trust Genome Campus, Hinxton, UK
| | - Pierre Murat
- Medical Research Council Laboratory of Molecular Biology, Cambridge, UK
| | - Fabian B H Rehm
- Medical Research Council Laboratory of Molecular Biology, Cambridge, UK
| | - Julian E Sale
- Medical Research Council Laboratory of Molecular Biology, Cambridge, UK
| | - Jason W Chin
- Medical Research Council Laboratory of Molecular Biology, Cambridge, UK.
| |
Collapse
|
6
|
Bello L, Wiedenhöft J, Schliep A. Compressed computations using wavelets for hidden Markov models with continuous observations. PLoS One 2023; 18:e0286074. [PMID: 37279196 DOI: 10.1371/journal.pone.0286074] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/08/2022] [Accepted: 05/09/2023] [Indexed: 06/08/2023] Open
Abstract
Compression as an accelerant of computation is increasingly recognized as an important component in engineering fast real-world machine learning methods for big data; c.f., its impact on genome-scale approximate string matching. Previous work showed that compression can accelerate algorithms for Hidden Markov Models (HMM) with discrete observations, both for the classical frequentist HMM algorithms-Forward Filtering, Backward Smoothing and Viterbi-and Gibbs sampling for Bayesian HMM. For Bayesian HMM with continuous-valued observations, compression was shown to greatly accelerate computations for specific types of data. For instance, data from large-scale experiments interrogating structural genetic variation can be assumed to be piece-wise constant with noise, or, equivalently, data generated by HMM with dominant self-transition probabilities. Here we extend the compressive computation approach to the classical frequentist HMM algorithms on continuous-valued observations, providing the first compressive approach for this problem. In a large-scale simulation study, we demonstrate empirically that in many settings compressed HMM algorithms very clearly outperform the classical algorithms with no, or only an insignificant effect, on the computed probabilities and infered state paths of maximal likelihood. This provides an efficient approach to big data computations with HMM. An open-source implementation of the method is available from https://github.com/lucabello/wavelet-hmms.
Collapse
Affiliation(s)
- Luca Bello
- Computer Science and Engineering, University of Gothenburg, Chalmers, Gothenburg, Sweden
| | - John Wiedenhöft
- Scientific Core Facility Medical Biometry and Statistical Bioinformatics, University Medical Center Göttingen, Göttingen, Germany
| | - Alexander Schliep
- Computer Science and Engineering, University of Gothenburg, Chalmers, Gothenburg, Sweden
- Faculty of Health Sciences, B-TU Cottbus-Senftenberg, Cottbus, Germany
| |
Collapse
|
7
|
Wong J, Coombe L, Nikolić V, Zhang E, Nip KM, Sidhu P, Warren RL, Birol I. Linear time complexity de novo long read genome assembly with GoldRush. Nat Commun 2023; 14:2906. [PMID: 37217507 DOI: 10.1038/s41467-023-38716-x] [Citation(s) in RCA: 4] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/10/2022] [Accepted: 05/11/2023] [Indexed: 05/24/2023] Open
Abstract
Current state-of-the-art de novo long read genome assemblers follow the Overlap-Layout-Consensus paradigm. While read-to-read overlap - its most costly step - was improved in modern long read genome assemblers, these tools still often require excessive RAM when assembling a typical human dataset. Our work departs from this paradigm, foregoing all-vs-all sequence alignments in favor of a dynamic data structure implemented in GoldRush, a de novo long read genome assembly algorithm with linear time complexity. We tested GoldRush on Oxford Nanopore Technologies long sequencing read datasets with different base error profiles sourced from three human cell lines, rice, and tomato. Here, we show that GoldRush achieves assembly scaffold NGA50 lengths of 18.3-22.2, 0.3 and 2.6 Mbp, for the genomes of human, rice, and tomato, respectively, and assembles each genome within a day, using at most 54.5 GB of random-access memory, demonstrating the scalability of our genome assembly paradigm and its implementation.
Collapse
Affiliation(s)
- Johnathan Wong
- Canada's Michael Smith Genome Sciences Centre, BC Cancer, Vancouver, BC, V5Z 4S6, Canada.
| | - Lauren Coombe
- Canada's Michael Smith Genome Sciences Centre, BC Cancer, Vancouver, BC, V5Z 4S6, Canada
| | - Vladimir Nikolić
- Canada's Michael Smith Genome Sciences Centre, BC Cancer, Vancouver, BC, V5Z 4S6, Canada
| | - Emily Zhang
- Canada's Michael Smith Genome Sciences Centre, BC Cancer, Vancouver, BC, V5Z 4S6, Canada
| | - Ka Ming Nip
- Canada's Michael Smith Genome Sciences Centre, BC Cancer, Vancouver, BC, V5Z 4S6, Canada
| | - Puneet Sidhu
- Canada's Michael Smith Genome Sciences Centre, BC Cancer, Vancouver, BC, V5Z 4S6, Canada
| | - René L Warren
- Canada's Michael Smith Genome Sciences Centre, BC Cancer, Vancouver, BC, V5Z 4S6, Canada
| | - Inanç Birol
- Canada's Michael Smith Genome Sciences Centre, BC Cancer, Vancouver, BC, V5Z 4S6, Canada.
| |
Collapse
|
8
|
Ilan Y. Constrained disorder principle-based variability is fundamental for biological processes: Beyond biological relativity and physiological regulatory networks. PROGRESS IN BIOPHYSICS AND MOLECULAR BIOLOGY 2023; 180-181:37-48. [PMID: 37068713 DOI: 10.1016/j.pbiomolbio.2023.04.003] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 01/27/2023] [Revised: 03/26/2023] [Accepted: 04/14/2023] [Indexed: 04/19/2023]
Abstract
The constrained disorder principle (CDP) defines systems based on their degree of disorder bounded by dynamic boundaries. The principle explains stochasticity in living and non-living systems. Denis Noble described the importance of stochasticity in biology, emphasizing stochastic processes at molecular, cellular, and higher levels in organisms as having a role beyond simple noise. The CDP and Noble's theories (NT) claim that biological systems use stochasticity. This paper presents the CDP and NT, discussing common notions and differences between the two theories. The paper presents the CDP-based concept of taking the disorder beyond its role in nature to correct malfunctions of systems and improve the efficiency of biological systems. The use of CDP-based algorithms embedded in second-generation artificial intelligence platforms is described. In summary, noise is inherent to complex systems and has a functional role. The CDP provides the option of using noise to improve functionality.
Collapse
Affiliation(s)
- Yaron Ilan
- Faculty of Medicine, Hebrew University, Department of Medicine, Hadassah Medical Center, Jerusalem, Israel.
| |
Collapse
|
9
|
D’Iorio M, Dewar K. Replication-associated inversions are the dominant form of bacterial chromosome structural variation. Life Sci Alliance 2022; 6:6/1/e202201434. [PMID: 36261227 PMCID: PMC9584773 DOI: 10.26508/lsa.202201434] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/04/2022] [Revised: 09/29/2022] [Accepted: 09/30/2022] [Indexed: 11/24/2022] Open
Abstract
The structural arrangements of bacterial chromosomes vary widely between closely related species and can result in significant phenotypic outcomes. The appearance of large-scale chromosomal inversions that are symmetric relative to markers for the origin of replication (OriC) has been previously observed; however, the overall prevalence of replication-associated structural rearrangements (RASRs) in bacteria and their causal mechanisms are currently unknown. Here, we systematically identify the locations of RASRs in species with multiple complete-sequenced genomes and investigate potential mediating biological mechanisms. We found that 247 of 313 species contained sequences with at least one large (>50 Kb) inversion in their sequence comparisons, and the aggregated inversion distances away from symmetry were normally distributed with a mean of zero. Many inversions that were offset from dnaA were found to be centered on a different marker for the OriC Instances of flanking repeats provide evidence that breaks formed during the replication process could be repaired to opposing positions. We also found a strong relationship between the later stages of replication and the range in distance variation from symmetry.
Collapse
Affiliation(s)
- Matthew D’Iorio
- Quantitative Life Sciences, McGill University, Montreal, Canada,Correspondence:
| | - Ken Dewar
- Department of Human Genetics, McGill University, Montreal, Canada,Centre for Microbiome Research, McGill University, Montreal, Canada
| |
Collapse
|
10
|
Di Stefano L. All Quiet on the TE Front? The Role of Chromatin in Transposable Element Silencing. Cells 2022; 11:cells11162501. [PMID: 36010577 PMCID: PMC9406493 DOI: 10.3390/cells11162501] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/27/2022] [Revised: 07/27/2022] [Accepted: 08/03/2022] [Indexed: 01/09/2023] Open
Abstract
Transposable elements (TEs) are mobile genetic elements that constitute a sizeable portion of many eukaryotic genomes. Through their mobility, they represent a major source of genetic variation, and their activation can cause genetic instability and has been linked to aging, cancer and neurodegenerative diseases. Accordingly, tight regulation of TE transcription is necessary for normal development. Chromatin is at the heart of TE regulation; however, we still lack a comprehensive understanding of the precise role of chromatin marks in TE silencing and how chromatin marks are established and maintained at TE loci. In this review, I discuss evidence documenting the contribution of chromatin-associated proteins and histone marks in TE regulation across different species with an emphasis on Drosophila and mammalian systems.
Collapse
Affiliation(s)
- Luisa Di Stefano
- Molecular, Cellular and Developmental Biology Department (MCD), Centre de Biologie Intégrative (CBI), University of Toulouse, CNRS, UPS, 31062 Toulouse, France
| |
Collapse
|
11
|
Feng S, Opit G, Deng W, Stejskal V, Li Z. A chromosome-level genome of the booklouse, Liposcelis brunnea, provides insight into louse evolution and environmental stress adaptation. Gigascience 2022; 11:giac062. [PMID: 35852419 PMCID: PMC9295366 DOI: 10.1093/gigascience/giac062] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/29/2022] [Revised: 05/03/2022] [Accepted: 05/30/2022] [Indexed: 11/15/2022] Open
Abstract
BACKGROUND Booklice (psocids) in the genus Liposcelis (Psocoptera: Liposcelididae) are a group of important storage pests, found in libraries, grain storages, and food-processing facilities. Booklice are able to survive under heat treatment and typically possess high resistance to common fumigant insecticides, hence posing a threat to storage security worldwide. RESULTS We assembled the genome of the booklouse, L. brunnea, the first genome reported in Psocoptera, using PacBio long-read sequencing, Illumina sequencing, and chromatin conformation capture (Hi-C) methods. After assembly, polishing, haplotype purging, and Hi-C scaffolding, we obtained 9 linkage groups (174.1 Mb in total) ranging from 12.1 Mb to 27.6 Mb (N50: 19.7 Mb), with the BUSCO completeness at 98.9%. In total, 15,543 genes were predicted by the Maker pipeline. Gene family analyses indicated the sensing-related gene families (OBP and OR) and the resistance-related gene families (ABC, EST, GST, UGT, and P450) expanded significantly in L. brunnea compared with those of their closest relatives (2 parasitic lice). Based on transcriptomic analysis, we found that the CYP4 subfamily from the P450 gene family functioned during phosphine fumigation; HSP genes, particularly those from the HSP70 subfamily, were upregulated significantly under high temperatures. CONCLUSIONS We present a chromosome-level genome assembly of L. brunnea, the first genome reported for the order Psocoptera. Our analyses provide new insights into the gene family evolution of the louse clade and the transcriptomic responses of booklice to environmental stresses.
Collapse
Affiliation(s)
- Shiqian Feng
- Department of Plant Biosecurity, College of Plant Protection, China Agricultural University, Beijing 100193, China
- Key Laboratory of Surveillance and Management for Plant Quarantine Pests, Ministry of Agriculture and Rural Affairs, Beijing 100193, China
| | - George Opit
- Department of Entomology and Plant Pathology, Oklahoma State University, Oklahoma 74078, Stillwater, USA
| | - Wenxin Deng
- Department of Plant Biosecurity, College of Plant Protection, China Agricultural University, Beijing 100193, China
- Key Laboratory of Surveillance and Management for Plant Quarantine Pests, Ministry of Agriculture and Rural Affairs, Beijing 100193, China
| | - Vaclav Stejskal
- Crop Research Institute, Drnovská 507, 161 06 Prague 6, Czech Republic
- Faculty of Agrobiology, Food and Natural Resources, Czech University of Life Sciences, Kamycka 129, 165 00 Prague, Czech Republic
| | - Zhihong Li
- Department of Plant Biosecurity, College of Plant Protection, China Agricultural University, Beijing 100193, China
- Key Laboratory of Surveillance and Management for Plant Quarantine Pests, Ministry of Agriculture and Rural Affairs, Beijing 100193, China
| |
Collapse
|
12
|
Yasir M, Turner AK, Lott M, Rudder S, Baker D, Bastkowski S, Page AJ, Webber MA, Charles IG. Long-read sequencing for identification of insertion sites in large transposon mutant libraries. Sci Rep 2022; 12:3546. [PMID: 35241765 PMCID: PMC8894413 DOI: 10.1038/s41598-022-07557-x] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/04/2021] [Accepted: 02/14/2022] [Indexed: 11/09/2022] Open
Abstract
Transposon insertion site sequencing (TIS) is a powerful method for associating genotype to phenotype. However, all TIS methods described to date use short nucleotide sequence reads which cannot uniquely determine the locations of transposon insertions within repeating genomic sequences where the repeat units are longer than the sequence read length. To overcome this limitation, we have developed a TIS method using Oxford Nanopore sequencing technology that generates and uses long nucleotide sequence reads; we have called this method LoRTIS (Long-Read Transposon Insertion-site Sequencing). LoRTIS enabled the unique localisation of transposon insertion sites within long repetitive genetic elements of E. coli, such as the transposase genes of insertion sequences and copies of the ~ 5 kb ribosomal RNA operon. We demonstrate that LoRTIS is reproducible, gives comparable results to short-read TIS methods for essential genes, and better resolution around repeat elements. The Oxford Nanopore sequencing device that we used is cost-effective, small and easily portable. Thus, LoRTIS is an efficient means of uniquely identifying transposon insertion sites within long repetitive genetic elements and can be easily transported to, and used in, laboratories that lack access to expensive DNA sequencing facilities.
Collapse
Affiliation(s)
- Muhammad Yasir
- Quadram Institute Bioscience, Rosalind Franklin Road, Norwich, NR4 7UQ, UK.
| | - A Keith Turner
- Quadram Institute Bioscience, Rosalind Franklin Road, Norwich, NR4 7UQ, UK
| | - Martin Lott
- Quadram Institute Bioscience, Rosalind Franklin Road, Norwich, NR4 7UQ, UK
| | - Steven Rudder
- Quadram Institute Bioscience, Rosalind Franklin Road, Norwich, NR4 7UQ, UK
| | - David Baker
- Quadram Institute Bioscience, Rosalind Franklin Road, Norwich, NR4 7UQ, UK
| | - Sarah Bastkowski
- Quadram Institute Bioscience, Rosalind Franklin Road, Norwich, NR4 7UQ, UK
| | - Andrew J Page
- Quadram Institute Bioscience, Rosalind Franklin Road, Norwich, NR4 7UQ, UK
| | - Mark A Webber
- Quadram Institute Bioscience, Rosalind Franklin Road, Norwich, NR4 7UQ, UK.,Norwich Medical School, Norwich Research Park, Colney Lane, Norwich, NR4 7TJ, UK.,University of East Anglia, Norwich Research Park, Norwich, NR4 7TJ, UK
| | - Ian G Charles
- Quadram Institute Bioscience, Rosalind Franklin Road, Norwich, NR4 7UQ, UK.,Norwich Medical School, Norwich Research Park, Colney Lane, Norwich, NR4 7TJ, UK.,University of East Anglia, Norwich Research Park, Norwich, NR4 7TJ, UK
| |
Collapse
|
13
|
Cunial F, Denas O, Belazzougui D. Fast and compact matching statistics analytics. Bioinformatics 2022; 38:1838-1845. [PMID: 35134833 PMCID: PMC9665870 DOI: 10.1093/bioinformatics/btac064] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/20/2021] [Revised: 01/08/2022] [Accepted: 01/31/2022] [Indexed: 02/03/2023] Open
Abstract
MOTIVATION Fast, lightweight methods for comparing the sequence of ever larger assembled genomes from ever growing databases are increasingly needed in the era of accurate long reads and pan-genome initiatives. Matching statistics is a popular method for computing whole-genome phylogenies and for detecting structural rearrangements between two genomes, since it is amenable to fast implementations that require a minimal setup of data structures. However, current implementations use a single core, take too much memory to represent the result, and do not provide efficient ways to analyze the output in order to explore local similarities between the sequences. RESULTS We develop practical tools for computing matching statistics between large-scale strings, and for analyzing its values, faster and using less memory than the state-of-the-art. Specifically, we design a parallel algorithm for shared-memory machines that computes matching statistics 30 times faster with 48 cores in the cases that are most difficult to parallelize. We design a lossy compression scheme that shrinks the matching statistics array to a bitvector that takes from 0.8 to 0.2 bits per character, depending on the dataset and on the value of a threshold, and that achieves 0.04 bits per character in some variants. And we provide efficient implementations of range-maximum and range-sum queries that take a few tens of milliseconds while operating on our compact representations, and that allow computing key local statistics about the similarity between two strings. Our toolkit makes construction, storage and analysis of matching statistics arrays practical for multiple pairs of the largest genomes available today, possibly enabling new applications in comparative genomics. AVAILABILITY AND IMPLEMENTATION Our C/C++ code is available at https://github.com/odenas/indexed_ms under GPL-3.0. The data underlying this article are available in NCBI Genome at https://www.ncbi.nlm.nih.gov/genome and in the International Genome Sample Resource (IGSR) at https://www.internationalgenome.org. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Fabio Cunial
- Max Planck Institute for Molecular Cell Biology and Genetics (MPI-CBG and CSBD), Dresden 01307, Germany,To whom correspondence should be addressed.
| | | | - Djamal Belazzougui
- CAPA, DTISI, Centre de Recherche sur l’Information Scientifique et Techique, Algiers, Algeria
| |
Collapse
|
14
|
Garikipati VNS, Uchida S. Elucidating the Functions of Non-Coding RNAs from the Perspective of RNA Modifications. Noncoding RNA 2021; 7:ncrna7020031. [PMID: 34065036 PMCID: PMC8163165 DOI: 10.3390/ncrna7020031] [Citation(s) in RCA: 7] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/01/2021] [Revised: 04/29/2021] [Accepted: 05/05/2021] [Indexed: 12/11/2022] Open
Abstract
It is now commonly accepted that most of the mammalian genome is transcribed as RNA, yet less than 2% of such RNA encode for proteins. A majority of transcribed RNA exists as non-protein-coding RNAs (ncRNAs) with various functions. Because of the lack of sequence homologies among most ncRNAs species, it is difficult to infer the potential functions of ncRNAs by examining sequence patterns, such as catalytic domains, as in the case of proteins. Added to the existing complexity of predicting the functions of the ever-growing number of ncRNAs, increasing evidence suggests that various enzymes modify ncRNAs (e.g., ADARs, METTL3, and METTL14), which has opened up a new field of study called epitranscriptomics. Here, we examine the current status of ncRNA research from the perspective of epitranscriptomics.
Collapse
Affiliation(s)
- Venkata Naga Srikanth Garikipati
- Department of Emergency Medicine, The Ohio State University Wexner Medical Center, Columbus, OH 43210, USA;
- Dorothy M. Davis Heart Lung and Research Institute, The Ohio State University Wexner Medical Center, Columbus, OH 43210, USA
| | - Shizuka Uchida
- Center for RNA Medicine, Department of Clinical Medicine, Aalborg University, Frederikskaj 10B, 2. (building C), DK-2450 Copenhagen SV, Denmark
- Correspondence: or
| |
Collapse
|
15
|
Heo Y, Manikandan G, Ramachandran A, Chen D. Comprehensive Evaluation of Error-Correction Methodologies for Genome Sequencing Data. Bioinformatics 2021. [DOI: 10.36255/exonpublications.bioinformatics.2021.ch6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/20/2022] Open
|
16
|
Sadeq S, Al-Hashimi S, Cusack CM, Werner A. Endogenous Double-Stranded RNA. Noncoding RNA 2021; 7:15. [PMID: 33669629 PMCID: PMC7930956 DOI: 10.3390/ncrna7010015] [Citation(s) in RCA: 29] [Impact Index Per Article: 9.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/31/2021] [Revised: 02/15/2021] [Accepted: 02/17/2021] [Indexed: 02/07/2023] Open
Abstract
The birth of long non-coding RNAs (lncRNAs) is closely associated with the presence and activation of repetitive elements in the genome. The transcription of endogenous retroviruses as well as long and short interspersed elements is not only essential for evolving lncRNAs but is also a significant source of double-stranded RNA (dsRNA). From an lncRNA-centric point of view, the latter is a minor source of bother in the context of the entire cell; however, dsRNA is an essential threat. A viral infection is associated with cytoplasmic dsRNA, and endogenous RNA hybrids only differ from viral dsRNA by the 5' cap structure. Hence, a multi-layered defense network is in place to protect cells from viral infections but tolerates endogenous dsRNA structures. A first line of defense is established with compartmentalization; whereas endogenous dsRNA is found predominantly confined to the nucleus and the mitochondria, exogenous dsRNA reaches the cytoplasm. Here, various sensor proteins recognize features of dsRNA including the 5' phosphate group of viral RNAs or hybrids with a particular length but not specific nucleotide sequences. The sensors trigger cellular stress pathways and innate immunity via interferon signaling but also induce apoptosis via caspase activation. Because of its central role in viral recognition and immune activation, dsRNA sensing is implicated in autoimmune diseases and used to treat cancer.
Collapse
Affiliation(s)
| | | | | | - Andreas Werner
- Biosciences Institute, Medical School, Newcastle University, Newcastle upon Tyne NE2 4HH, UK; (S.S.); (S.A.-H.); (C.M.C.)
| |
Collapse
|
17
|
Paredes-Céspedes DM, Rojas-García AE, Medina-Díaz IM, Ramos KS, Herrera-Moreno JF, Barrón-Vivanco BS, González-Arias CA, Bernal-Hernández YY. Environmental and socio-cultural impacts on global DNA methylation in the indigenous Huichol population of Nayarit, Mexico. ENVIRONMENTAL SCIENCE AND POLLUTION RESEARCH INTERNATIONAL 2021; 28:4472-4487. [PMID: 32940839 DOI: 10.1007/s11356-020-10804-1] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 07/01/2019] [Accepted: 09/09/2020] [Indexed: 06/11/2023]
Abstract
Alterations of global DNA methylation have been evaluated in several studies worldwide; however, Long Interspersed Nuclear Elements-1 (LINE-1) methylation in genetically conserved populations such as indigenous communities have not, to our knowledge, been reported. The aim of this study was to evaluate the relationship between LINE-1 methylation patterns and factors such as pesticide exposure and socio-cultural characteristics in the Indigenous Huichol Population of Nayarit, Mexico. A cross-sectional study was conducted in 140 Huichol indigenous individuals. A structured questionnaire was used to determine general and anthropometric characteristics, diet, harmful habits, and pesticide exposure. DNA methylation was determined by pyrosequencing of bisulfite-treated DNA. A lower level of LINE-1 methylation was found in the indigenous population when compared to a Mestizo population previously studied by our group. This difference might be due to the influence of the genetic admixture and differing dietary and lifestyle habits. The males in the indigenous population exhibited increased LINE-1 methylation in comparison to the females. Sex and alcohol consumption showed positive associations with LINE-1 methylation, while weight, current work in the field, current pesticide usage, and folate intake exhibited negative associations with LINE-1 methylation. The results suggest that ethnicity, as well as other internal and environmental factors, might influence LINE-1 methylation.
Collapse
Affiliation(s)
- Diana Marcela Paredes-Céspedes
- Posgrado en Ciencias Biológico Agropecuarias, Unidad Académica de Agricultura, Km. 9 Carretera Tepic-Compostela, Xalisco, Nayarit, México
- Laboratorio de Contaminación y Toxicología Ambiental, Secretaría de Investigación y Posgrado, Universidad Autónoma de Nayarit, Ciudad de la Cultura s/n. C.P, 6300, Tepic, Nayarit, México
| | - Aurora Elizabeth Rojas-García
- Laboratorio de Contaminación y Toxicología Ambiental, Secretaría de Investigación y Posgrado, Universidad Autónoma de Nayarit, Ciudad de la Cultura s/n. C.P, 6300, Tepic, Nayarit, México
| | - Irma Martha Medina-Díaz
- Laboratorio de Contaminación y Toxicología Ambiental, Secretaría de Investigación y Posgrado, Universidad Autónoma de Nayarit, Ciudad de la Cultura s/n. C.P, 6300, Tepic, Nayarit, México
| | - Kenneth S Ramos
- Institute of Biosciences and Technology, Texas A&M University Health Science Center, 121 W. Holcombe Blvd, Houston, TX, 77030 m EE,UU, USA
| | - José Francisco Herrera-Moreno
- Posgrado en Ciencias Biológico Agropecuarias, Unidad Académica de Agricultura, Km. 9 Carretera Tepic-Compostela, Xalisco, Nayarit, México
- Laboratorio de Contaminación y Toxicología Ambiental, Secretaría de Investigación y Posgrado, Universidad Autónoma de Nayarit, Ciudad de la Cultura s/n. C.P, 6300, Tepic, Nayarit, México
| | - Briscia Socorro Barrón-Vivanco
- Laboratorio de Contaminación y Toxicología Ambiental, Secretaría de Investigación y Posgrado, Universidad Autónoma de Nayarit, Ciudad de la Cultura s/n. C.P, 6300, Tepic, Nayarit, México
| | - Cyndia Azucena González-Arias
- Laboratorio de Contaminación y Toxicología Ambiental, Secretaría de Investigación y Posgrado, Universidad Autónoma de Nayarit, Ciudad de la Cultura s/n. C.P, 6300, Tepic, Nayarit, México
| | - Yael Yvette Bernal-Hernández
- Laboratorio de Contaminación y Toxicología Ambiental, Secretaría de Investigación y Posgrado, Universidad Autónoma de Nayarit, Ciudad de la Cultura s/n. C.P, 6300, Tepic, Nayarit, México.
| |
Collapse
|
18
|
An emerging role of chromatin-interacting RNA-binding proteins in transcription regulation. Essays Biochem 2020; 64:907-918. [PMID: 33034346 DOI: 10.1042/ebc20200004] [Citation(s) in RCA: 13] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/23/2020] [Revised: 09/08/2020] [Accepted: 09/15/2020] [Indexed: 01/01/2023]
Abstract
Transcription factors (TFs) are well-established key factors orchestrating gene transcription, and RNA-binding proteins (RBPs) are mainly thought to participate in post-transcriptional control of gene. In fact, these two steps are functionally coupled, offering a possibility for reciprocal communications between transcription and regulatory RNAs and RBPs. Recently, a series of exploratory studies, utilizing functional genomic strategies, have revealed that RBPs are prevalently involved in transcription control genome-wide through their interactions with chromatin. Here, we present a refined census of RBPs to grope for such an emerging role and discuss the global view of RBP-chromatin interactions and their functional diversities in transcription regulation.
Collapse
|
19
|
Liu Q, Garcia M, Wang S, Chen CW. Therapeutic Target Discovery Using High-Throughput Genetic Screens in Acute Myeloid Leukemia. Cells 2020; 9:cells9081888. [PMID: 32806592 PMCID: PMC7465943 DOI: 10.3390/cells9081888] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/30/2020] [Revised: 08/09/2020] [Accepted: 08/10/2020] [Indexed: 12/20/2022] Open
Abstract
The development of high-throughput gene manipulating tools such as short hairpin RNA (shRNA) and CRISPR/Cas9 libraries has enabled robust characterization of novel functional genes contributing to the pathological states of the diseases. In acute myeloid leukemia (AML), these genetic screen approaches have been used to identify effector genes with previously unknown roles in AML. These AML-related genes centralize alongside the cellular pathways mediating epigenetics, signaling transduction, transcriptional regulation, and energy metabolism. The shRNA/CRISPR genetic screens also realized an array of candidate genes amenable to pharmaceutical targeting. This review aims to summarize genes, mechanisms, and potential therapeutic strategies found via high-throughput genetic screens in AML. We also discuss the potential of these findings to instruct novel AML therapies for combating drug resistance in this genetically heterogeneous disease.
Collapse
Affiliation(s)
- Qiao Liu
- Fujian Provincial Key Laboratory on Hematology, Department of Hematology, Fujian Institute of Hematology, Fujian Medical University Union Hospital, Fuzhou 350108, China; (Q.L.); (S.W.)
- Union Clinical Medical College, Fujian Medical University, Fuzhou 350108, China
- Department of Systems Biology, Beckman Research Institute of City of Hope, Duarte, CA 91010, USA;
| | - Michelle Garcia
- Department of Systems Biology, Beckman Research Institute of City of Hope, Duarte, CA 91010, USA;
- Pomona College, Claremont, CA 91711, USA
| | - Shaoyuan Wang
- Fujian Provincial Key Laboratory on Hematology, Department of Hematology, Fujian Institute of Hematology, Fujian Medical University Union Hospital, Fuzhou 350108, China; (Q.L.); (S.W.)
- Union Clinical Medical College, Fujian Medical University, Fuzhou 350108, China
| | - Chun-Wei Chen
- Department of Systems Biology, Beckman Research Institute of City of Hope, Duarte, CA 91010, USA;
- Correspondence:
| |
Collapse
|
20
|
Fan R, Gu Z, Guang X, Marín JC, Varas V, González BA, Wheeler JC, Hu Y, Li E, Sun X, Yang X, Zhang C, Gao W, He J, Munch K, Corbett-Detig R, Barbato M, Pan S, Zhan X, Bruford MW, Dong C. Genomic analysis of the domestication and post-Spanish conquest evolution of the llama and alpaca. Genome Biol 2020; 21:159. [PMID: 32616020 PMCID: PMC7331169 DOI: 10.1186/s13059-020-02080-6] [Citation(s) in RCA: 22] [Impact Index Per Article: 5.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/08/2019] [Accepted: 06/21/2020] [Indexed: 12/28/2022] Open
Abstract
BACKGROUND Despite their regional economic importance and being increasingly reared globally, the origins and evolution of the llama and alpaca remain poorly understood. Here we report reference genomes for the llama, and for the guanaco and vicuña (their putative wild progenitors), compare these with the published alpaca genome, and resequence seven individuals of all four species to better understand domestication and introgression between the llama and alpaca. RESULTS Phylogenomic analysis confirms that the llama was domesticated from the guanaco and the alpaca from the vicuña. Introgression was much higher in the alpaca genome (36%) than the llama (5%) and could be dated close to the time of the Spanish conquest, approximately 500 years ago. Introgression patterns are at their most variable on the X-chromosome of the alpaca, featuring 53 genes known to have deleterious X-linked phenotypes in humans. Strong genome-wide introgression signatures include olfactory receptor complexes into both species, hypertension resistance into alpaca, and fleece/fiber traits into llama. Genomic signatures of domestication in the llama include male reproductive traits, while in alpaca feature fleece characteristics, olfaction-related and hypoxia adaptation traits. Expression analysis of the introgressed region that is syntenic to human HSA4q21, a gene cluster previously associated with hypertension in humans under hypoxic conditions, shows a previously undocumented role for PRDM8 downregulation as a potential transcriptional regulation mechanism, analogous to that previously reported at high altitude for hypoxia-inducible factor 1α. CONCLUSIONS The unprecedented introgression signatures within both domestic camelid genomes may reflect post-conquest changes in agriculture and the breakdown of traditional management practices.
Collapse
Affiliation(s)
- Ruiwen Fan
- College of Animal Science and Veterinary Medicine, Shanxi Agricultural University, Taigu, Shanxi China
| | - Zhongru Gu
- CAS Key Lab of Animal Ecology and Conservation Biology, Institute of Zoology, Chinese Academy of Sciences, Beijing, China
- Cardiff University – Institute of Zoology Joint Laboratory for Biocomplexity Research, Chinese Academy of Sciences, Beijing, China
- University of Chinese Academy of Sciences, Beijing, China
| | | | - Juan Carlos Marín
- Departamento de Ciencias Básicas, Facultad de Ciencias, Universidad del Bio Bio, Chillán, Chile
| | - Valeria Varas
- Programa de Doctorado en Ciencias mención Ecología y Evolución, Escuela de Graduados, Facultad de Ciencias., Universidad Austral de Chile, Valdivia, Chile
| | - Benito A. González
- Facultad de Ciencias Forestales y de la Conservación de la Naturaleza, Universidad de Chile, Santiago, Chile
| | - Jane C. Wheeler
- CONOPA-Instituto de Investigación y Desarrollo de Camélidos Sudamericanos, Pachacamac, Lima, Peru
| | - Yafei Hu
- BGI Genomics, BGI, Shenzhen, China
| | - Erli Li
- BGI Genomics, BGI, Shenzhen, China
| | | | | | | | - Wenjun Gao
- College of Animal Science and Veterinary Medicine, Shanxi Agricultural University, Taigu, Shanxi China
| | - Junping He
- College of Animal Science and Veterinary Medicine, Shanxi Agricultural University, Taigu, Shanxi China
| | - Kasper Munch
- Bioinformatics Research Centre, Aarhus University, Aarhus, Denmark
| | - Russel Corbett-Detig
- Department of Biomolecular Engineering and Genomics Institute, UC Santa Cruz, Santa Cruz, CA USA
| | - Mario Barbato
- Department of Animal Science, Food and Technology – DIANA, Università Cattolica del Sacro Cuore, Piacenza, Italy
| | - Shengkai Pan
- CAS Key Lab of Animal Ecology and Conservation Biology, Institute of Zoology, Chinese Academy of Sciences, Beijing, China
- Cardiff University – Institute of Zoology Joint Laboratory for Biocomplexity Research, Chinese Academy of Sciences, Beijing, China
| | - Xiangjiang Zhan
- CAS Key Lab of Animal Ecology and Conservation Biology, Institute of Zoology, Chinese Academy of Sciences, Beijing, China
- Cardiff University – Institute of Zoology Joint Laboratory for Biocomplexity Research, Chinese Academy of Sciences, Beijing, China
- Center for Excellence in Animal Evolution and Genetics, Chinese Academy of Sciences, Kunming, China
| | - Michael W. Bruford
- Cardiff University – Institute of Zoology Joint Laboratory for Biocomplexity Research, Chinese Academy of Sciences, Beijing, China
- School of Biosciences and Sustainable Places Institute, Cardiff University, Cardiff, Wales UK
| | - Changsheng Dong
- College of Animal Science and Veterinary Medicine, Shanxi Agricultural University, Taigu, Shanxi China
| |
Collapse
|
21
|
Corless S, Höcker S, Erhardt S. Centromeric RNA and Its Function at and Beyond Centromeric Chromatin. J Mol Biol 2020; 432:4257-4269. [DOI: 10.1016/j.jmb.2020.03.027] [Citation(s) in RCA: 17] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/30/2019] [Revised: 03/26/2020] [Accepted: 03/27/2020] [Indexed: 12/21/2022]
|
22
|
Pirogov A, Pfaffelhuber P, Börsch-Haubold A, Haubold B. High-complexity regions in mammalian genomes are enriched for developmental genes. Bioinformatics 2020; 35:1813-1819. [PMID: 30395202 PMCID: PMC6546125 DOI: 10.1093/bioinformatics/bty922] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/23/2018] [Revised: 09/17/2018] [Accepted: 11/02/2018] [Indexed: 12/01/2022] Open
Abstract
Motivation Unique sequence regions are associated with genetic function in vertebrate genomes. However, measuring uniqueness, or absence of long repeats, along a genome is conceptually and computationally difficult. Here we use a variant of the Lempel-Ziv complexity, the match complexity, Cm, and augment it by deriving its null distribution for random sequences. We then apply Cm to the human and mouse genomes to investigate the relationship between sequence complexity and function. Results We implemented Cm in the program macle and show through simulation that the newly derived null distribution of Cm is accurate. This allows us to delineate high-complexity regions in the human and mouse genomes. Using our program macle2go, we find that these regions are twofold enriched for genes. Moreover, the genes contained in these regions are more than 10-fold enriched for developmental functions. Availability and implementation Source code for macle and macle2go is available from www.github.com/evolbioinf/macle and www.github.com/evolbioinf/macle2go, respectively; Cm browser tracks from guanine.evolbio.mgp.de/complexity. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Anton Pirogov
- Lehrstuhl für Informatik, RWTH Aachen University, Max-Planck-Institute for Evolutionary Biology, Plön, Germany.,Department of Evolutionary Genetics, Max-Planck-Institute for Evolutionary Biology, Plön, Germany
| | | | | | - Bernhard Haubold
- Department of Evolutionary Genetics, Max-Planck-Institute for Evolutionary Biology, Plön, Germany
| |
Collapse
|
23
|
Abstract
Background: Data sets from long-read sequencing platforms (Oxford Nanopore Technologies and Pacific Biosciences) allow for most prokaryote genomes to be completely assembled - one contig per chromosome or plasmid. However, the high per-read error rate of long-read sequencing necessitates different approaches to assembly than those used for short-read sequencing. Multiple assembly tools (assemblers) exist, which use a variety of algorithms for long-read assembly. Methods: We used 500 simulated read sets and 120 real read sets to assess the performance of six long-read assemblers (Canu, Flye, Miniasm/Minipolish, Raven, Redbean and Shasta) across a wide variety of genomes and read parameters. Assemblies were assessed on their structural accuracy/completeness, sequence identity, contig circularisation and computational resources used. Results: Canu v1.9 produced moderately reliable assemblies but had the longest runtimes of all assemblers tested. Flye v2.6 was more reliable and did particularly well with plasmid assembly. Miniasm/Minipolish v0.3 was the only assembler which consistently produced clean contig circularisation. Raven v0.0.5 was the most reliable for chromosome assembly, though it did not perform well on small plasmids and had circularisation issues. Redbean v2.5 and Shasta v0.3.0 were computationally efficient but more likely to produce incomplete assemblies. Conclusions: Of the assemblers tested, Flye, Miniasm/Minipolish and Raven performed best overall. However, no single tool performed well on all metrics, highlighting the need for continued development on long-read assembly algorithms.
Collapse
Affiliation(s)
- Ryan R. Wick
- Department of Infectious Diseases, Central Clinical School, Monash University, Melbourne, VIC, 3004, Australia
| | - Kathryn E. Holt
- Department of Infectious Diseases, Central Clinical School, Monash University, Melbourne, VIC, 3004, Australia
- Department of Infection Biology, London School of Hygiene & Tropical Medicine, London, WC1E 7HT, UK
| |
Collapse
|
24
|
Abstract
Background: Data sets from long-read sequencing platforms (Oxford Nanopore Technologies and Pacific Biosciences) allow for most prokaryote genomes to be completely assembled - one contig per chromosome or plasmid. However, the high per-read error rate of long-read sequencing necessitates different approaches to assembly than those used for short-read sequencing. Multiple assembly tools (assemblers) exist, which use a variety of algorithms for long-read assembly. Methods: We used 500 simulated read sets and 120 real read sets to assess the performance of seven long-read assemblers (Canu, Flye, Miniasm/Minipolish, NECAT, Raven, Redbean and Shasta) across a wide variety of genomes and read parameters. Assemblies were assessed on their structural accuracy/completeness, sequence identity, contig circularisation and computational resources used. Results: Canu v1.9 produced moderately reliable assemblies but had the longest runtimes of all assemblers tested. Flye v2.7 was more reliable and did particularly well with plasmid assembly. Miniasm/Minipolish v0.3 and NECAT v20200119 were the most likely to produce clean contig circularisation. Raven v0.0.8 was the most reliable for chromosome assembly, though it did not perform well on small plasmids and had circularisation issues. Redbean v2.5 and Shasta v0.4.0 were computationally efficient but more likely to produce incomplete assemblies. Conclusions: Of the assemblers tested, Flye, Miniasm/Minipolish and Raven performed best overall. However, no single tool performed well on all metrics, highlighting the need for continued development on long-read assembly algorithms.
Collapse
Affiliation(s)
- Ryan R. Wick
- Department of Infectious Diseases, Central Clinical School, Monash University, Melbourne, VIC, 3004, Australia
| | - Kathryn E. Holt
- Department of Infectious Diseases, Central Clinical School, Monash University, Melbourne, VIC, 3004, Australia
- Department of Infection Biology, London School of Hygiene & Tropical Medicine, London, WC1E 7HT, UK
| |
Collapse
|
25
|
Abstract
Background: Data sets from long-read sequencing platforms (Oxford Nanopore Technologies and Pacific Biosciences) allow for most prokaryote genomes to be completely assembled - one contig per chromosome or plasmid. However, the high per-read error rate of long-read sequencing necessitates different approaches to assembly than those used for short-read sequencing. Multiple assembly tools (assemblers) exist, which use a variety of algorithms for long-read assembly. Methods: We used 500 simulated read sets and 120 real read sets to assess the performance of eight long-read assemblers (Canu, Flye, Miniasm/Minipolish, NECAT, NextDenovo/NextPolish, Raven, Redbean and Shasta) across a wide variety of genomes and read parameters. Assemblies were assessed on their structural accuracy/completeness, sequence identity, contig circularisation and computational resources used. Results: Canu v2.0 produced reliable assemblies and was good with plasmids, but it performed poorly with circularisation and had the longest runtimes of all assemblers tested. Flye v2.8 was also reliable and made the smallest sequence errors, though it used the most RAM. Miniasm/Minipolish v0.3/v0.1.3 was the most likely to produce clean contig circularisation. NECAT v20200119 was reliable and good at circularisation but tended to make larger sequence errors. NextDenovo/NextPolish v2.3.0/v1.2.4 was reliable with chromosome assembly but bad with plasmid assembly. Raven v1.1.10 was the most reliable for chromosome assembly, though it did not perform well on small plasmids and had circularisation issues. Redbean v2.5 and Shasta v0.5.1 were computationally efficient but more likely to produce incomplete assemblies. Conclusions: Of the assemblers tested, Flye, Miniasm/Minipolish and Raven performed best overall. However, no single tool performed well on all metrics, highlighting the need for continued development on long-read assembly algorithms.
Collapse
Affiliation(s)
- Ryan R. Wick
- Department of Infectious Diseases, Central Clinical School, Monash University, Melbourne, VIC, 3004, Australia
| | - Kathryn E. Holt
- Department of Infectious Diseases, Central Clinical School, Monash University, Melbourne, VIC, 3004, Australia
- Department of Infection Biology, London School of Hygiene & Tropical Medicine, London, WC1E 7HT, UK
| |
Collapse
|
26
|
Abstract
Background: Data sets from long-read sequencing platforms (Oxford Nanopore Technologies and Pacific Biosciences) allow for most prokaryote genomes to be completely assembled - one contig per chromosome or plasmid. However, the high per-read error rate of long-read sequencing necessitates different approaches to assembly than those used for short-read sequencing. Multiple assembly tools (assemblers) exist, which use a variety of algorithms for long-read assembly. Methods: We used 500 simulated read sets and 120 real read sets to assess the performance of eight long-read assemblers (Canu, Flye, Miniasm/Minipolish, NECAT, NextDenovo/NextPolish, Raven, Redbean and Shasta) across a wide variety of genomes and read parameters. Assemblies were assessed on their structural accuracy/completeness, sequence identity, contig circularisation and computational resources used. Results: Canu v2.1 produced reliable assemblies and was good with plasmids, but it performed poorly with circularisation and had the longest runtimes of all assemblers tested. Flye v2.8 was also reliable and made the smallest sequence errors, though it used the most RAM. Miniasm/Minipolish v0.3/v0.1.3 was the most likely to produce clean contig circularisation. NECAT v20200803 was reliable and good at circularisation but tended to make larger sequence errors. NextDenovo/NextPolish v2.3.1/v1.3.1 was reliable with chromosome assembly but bad with plasmid assembly. Raven v1.3.0 was reliable for chromosome assembly, though it did not perform well on small plasmids and had circularisation issues. Redbean v2.5 and Shasta v0.7.0 were computationally efficient but more likely to produce incomplete assemblies. Conclusions: Of the assemblers tested, Flye, Miniasm/Minipolish, NextDenovo/NextPolish and Raven performed best overall. However, no single tool performed well on all metrics, highlighting the need for continued development on long-read assembly algorithms.
Collapse
Affiliation(s)
- Ryan R. Wick
- Department of Infectious Diseases, Central Clinical School, Monash University, Melbourne, VIC, 3004, Australia
| | - Kathryn E. Holt
- Department of Infectious Diseases, Central Clinical School, Monash University, Melbourne, VIC, 3004, Australia
- Department of Infection Biology, London School of Hygiene & Tropical Medicine, London, WC1E 7HT, UK
| |
Collapse
|
27
|
Goldstein S, Beka L, Graf J, Klassen JL. Evaluation of strategies for the assembly of diverse bacterial genomes using MinION long-read sequencing. BMC Genomics 2019; 20:23. [PMID: 30626323 PMCID: PMC6325685 DOI: 10.1186/s12864-018-5381-7] [Citation(s) in RCA: 94] [Impact Index Per Article: 18.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/12/2018] [Accepted: 12/16/2018] [Indexed: 11/23/2022] Open
Abstract
Background Short-read sequencing technologies have made microbial genome sequencing cheap and accessible. However, closing genomes is often costly and assembling short reads from genomes that are repetitive and/or have extreme %GC content remains challenging. Long-read, single-molecule sequencing technologies such as the Oxford Nanopore MinION have the potential to overcome these difficulties, although the best approach for harnessing their potential remains poorly evaluated. Results We sequenced nine bacterial genomes spanning a wide range of GC contents using Illumina MiSeq and Oxford Nanopore MinION sequencing technologies to determine the advantages of each approach, both individually and combined. Assemblies using only MiSeq reads were highly accurate but lacked contiguity, a deficiency that was partially overcome by adding MinION reads to these assemblies. Even more contiguous genome assemblies were generated by using MinION reads for initial assembly, but these assemblies were more error-prone and required further polishing. This was especially pronounced when Illumina libraries were biased, as was the case for our strains with both high and low GC content. Increased genome contiguity dramatically improved the annotation of insertion sequences and secondary metabolite biosynthetic gene clusters, likely because long-reads can disambiguate these highly repetitive but biologically important genomic regions. Conclusions Genome assembly using short-reads is challenged by repetitive sequences and extreme GC contents. Our results indicate that these difficulties can be largely overcome by using single-molecule, long-read sequencing technologies such as the Oxford Nanopore MinION. Using MinION reads for assembly followed by polishing with Illumina reads generated the most contiguous genomes with sufficient accuracy to enable the accurate annotation of important but difficult to sequence genomic features such as insertion sequences and secondary metabolite biosynthetic gene clusters. The combination of Oxford Nanopore and Illumina sequencing can therefore cost-effectively advance studies of microbial evolution and genome-driven drug discovery. Electronic supplementary material The online version of this article (10.1186/s12864-018-5381-7) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Sarah Goldstein
- Department of Molecular and Cell Biology, University of Connecticut, Storrs, CT, USA
| | - Lidia Beka
- Department of Molecular and Cell Biology, University of Connecticut, Storrs, CT, USA
| | - Joerg Graf
- Department of Molecular and Cell Biology, University of Connecticut, Storrs, CT, USA.
| | - Jonathan L Klassen
- Department of Molecular and Cell Biology, University of Connecticut, Storrs, CT, USA.
| |
Collapse
|
28
|
Pisupati R, Vergara D, Kane NC. Diversity and evolution of the repetitive genomic content in Cannabis sativa. BMC Genomics 2018; 19:156. [PMID: 29466945 PMCID: PMC5822635 DOI: 10.1186/s12864-018-4494-3] [Citation(s) in RCA: 25] [Impact Index Per Article: 4.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/30/2017] [Accepted: 01/24/2018] [Indexed: 01/13/2023] Open
Abstract
Background The repetitive content of the genome, once considered to be “junk DNA”, is in fact an essential component of genomic architecture and evolution. In this study, we used the genomes of three varieties of Cannabis sativa, three varieties of Humulus lupulus and one genotype of Morus notabilis to explore their repetitive content using a graph-based clustering method, designed to explore and compare repeat content in genomes that have not been fully assembled. Results The repetitive content in the C. sativa genome is mainly composed of the retrotransposons LTR/Copia and LTR/Gypsy (14% and 14.8%, respectively), ribosomal DNA (2%), and low-complexity sequences (29%). We observed a recent copy number expansion in some transposable element families. Simple repeats and low complexity regions of the genome show higher intra and inter species variation. Conclusions As with other sequenced genomes, the repetitive content of C. sativa’s genome exhibits a wide range of evolutionary patterns. Some repeat types have patterns of diversity consistent with expansions followed by losses in copy number, while others may have expanded more slowly and reached a steady state. Still, other repetitive sequences, particularly ribosomal DNA (rDNA), show signs of concerted evolution playing a major role in homogenizing sequence variation. Electronic supplementary material The online version of this article (10.1186/s12864-018-4494-3) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Rahul Pisupati
- Department of Biotechnology, Indian Institute of Technology, Kharagpur, 721302, India.,Present address: Gregor Mendel Institute, Dr. Bohr-gasse 3, Vienna, 1030, Austria
| | - Daniela Vergara
- Ecology and Evolutionary Biology, University of Colorado, Boulder, 80302, USA
| | - Nolan C Kane
- Ecology and Evolutionary Biology, University of Colorado, Boulder, 80302, USA.
| |
Collapse
|
29
|
Morgenstern B, Schöbel S, Leimeister CA. Phylogeny reconstruction based on the length distribution of k-mismatch common substrings. Algorithms Mol Biol 2017; 12:27. [PMID: 29238399 PMCID: PMC5724348 DOI: 10.1186/s13015-017-0118-8] [Citation(s) in RCA: 16] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/04/2017] [Accepted: 11/28/2017] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Various approaches to alignment-free sequence comparison are based on the length of exact or inexact word matches between pairs of input sequences. Haubold et al. (J Comput Biol 16:1487-1500, 2009) showed how the average number of substitutions per position between two DNA sequences can be estimated based on the average length of exact common substrings. RESULTS In this paper, we study the length distribution of k-mismatch common substrings between two sequences. We show that the number of substitutions per position can be accurately estimated from the position of a local maximum in the length distribution of their k-mismatch common substrings.
Collapse
Affiliation(s)
- Burkhard Morgenstern
- Department of Bioinformatics, Institute of Microbiology and Genetics, University of Goettingen, Goldschmidtstr. 1, 37077 Göttingen, Germany
| | - Svenja Schöbel
- Department of Bioinformatics, Institute of Microbiology and Genetics, University of Goettingen, Goldschmidtstr. 1, 37077 Göttingen, Germany
| | - Chris-André Leimeister
- Department of Bioinformatics, Institute of Microbiology and Genetics, University of Goettingen, Goldschmidtstr. 1, 37077 Göttingen, Germany
| |
Collapse
|
30
|
C L B, S Nair A. Benchmark Dataset for Whole Genome Sequence Compression. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2017; 14:1228-1236. [PMID: 27214907 DOI: 10.1109/tcbb.2016.2568186] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/05/2023]
Abstract
UNLABELLED The research in DNA data compression lacks a standard dataset to test out compression tools specific to DNA. This paper argues that the current state of achievement in DNA compression is unable to be benchmarked in the absence of such scientifically compiled whole genome sequence dataset and proposes a benchmark dataset using multistage sampling procedure. Considering the genome sequence of organisms available in the National Centre for Biotechnology and Information (NCBI) as the universe, the proposed dataset selects 1,105 prokaryotes, 200 plasmids, 164 viruses, and 65 eukaryotes. This paper reports the results of using three established tools on the newly compiled dataset and show that their strength and weakness are evident only with a comparison based on the scientifically compiled benchmark dataset. AVAILABILITY The sample dataset and the respective links are available @ https://sourceforge.net/projects/benchmarkdnacompressiondataset/.
Collapse
|
31
|
Cuellar TL, Herzner AM, Zhang X, Goyal Y, Watanabe C, Friedman BA, Janakiraman V, Durinck S, Stinson J, Arnott D, Cheung TK, Chaudhuri S, Modrusan Z, Doerr JM, Classon M, Haley B. Silencing of retrotransposons by SETDB1 inhibits the interferon response in acute myeloid leukemia. J Cell Biol 2017; 216:3535-3549. [PMID: 28887438 PMCID: PMC5674883 DOI: 10.1083/jcb.201612160] [Citation(s) in RCA: 130] [Impact Index Per Article: 18.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/21/2016] [Revised: 05/15/2017] [Accepted: 08/03/2017] [Indexed: 01/23/2023] Open
Abstract
Cancer cells can rewire genetic and epigenetic regulatory networks to promote cell proliferation and evade the immune system. Using a focused CRISPR/Cas9 genetic screen, Cuellar et al. identify a novel role for the SETDB1 histone methyltransferase in regulating the antiviral response in AML cells via the suppression of transposable elements. A propensity for rewiring genetic and epigenetic regulatory networks, thus enabling sustained cell proliferation, suppression of apoptosis, and the ability to evade the immune system, is vital to cancer cell propagation. An increased understanding of how this is achieved is critical for identifying or improving therapeutic interventions. In this study, using acute myeloid leukemia (AML) human cell lines and a custom CRISPR/Cas9 screening platform, we identify the H3K9 methyltransferase SETDB1 as a novel, negative regulator of innate immunity. SETDB1 is overexpressed in many cancers, and loss of this gene in AML cells triggers desilencing of retrotransposable elements that leads to the production of double-stranded RNAs (dsRNAs). This is coincident with induction of a type I interferon response and apoptosis through the dsRNA-sensing pathway. Collectively, our findings establish a unique gene regulatory axis that cancer cells can exploit to circumvent the immune system.
Collapse
Affiliation(s)
- Trinna L Cuellar
- Department of Molecular Biology, Genentech, Inc., South San Francisco, CA
| | | | - Xiaotian Zhang
- Department of Molecular Biology, Genentech, Inc., South San Francisco, CA
| | - Yogesh Goyal
- Department of Molecular Biology, Genentech, Inc., South San Francisco, CA
| | - Colin Watanabe
- Department of Bioinformatics and Computational Biology, Genentech, Inc., South San Francisco, CA
| | - Brad A Friedman
- Department of Bioinformatics and Computational Biology, Genentech, Inc., South San Francisco, CA
| | | | - Steffen Durinck
- Department of Molecular Biology, Genentech, Inc., South San Francisco, CA.,Department of Bioinformatics and Computational Biology, Genentech, Inc., South San Francisco, CA
| | - Jeremy Stinson
- Department of Molecular Biology, Genentech, Inc., South San Francisco, CA
| | - David Arnott
- Department of Protein Chemistry, Genentech, Inc., South San Francisco, CA
| | - Tommy K Cheung
- Department of Protein Chemistry, Genentech, Inc., South San Francisco, CA
| | - Subhra Chaudhuri
- Department of Molecular Biology, Genentech, Inc., South San Francisco, CA
| | - Zora Modrusan
- Department of Molecular Biology, Genentech, Inc., South San Francisco, CA
| | - Jonas Martin Doerr
- Department of Molecular Biology, Genentech, Inc., South San Francisco, CA
| | - Marie Classon
- Department of Discovery Oncology, Genentech, Inc., South San Francisco, CA
| | - Benjamin Haley
- Department of Molecular Biology, Genentech, Inc., South San Francisco, CA
| |
Collapse
|
32
|
ThermoAlign: a genome-aware primer design tool for tiled amplicon resequencing. Sci Rep 2017; 7:44437. [PMID: 28300202 PMCID: PMC5353602 DOI: 10.1038/srep44437] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/18/2016] [Accepted: 02/08/2017] [Indexed: 11/21/2022] Open
Abstract
Isolating and sequencing specific regions in a genome is a cornerstone of molecular biology. This has been facilitated by computationally encoding the thermodynamics of DNA hybridization for automated design of hybridization and priming oligonucleotides. However, the repetitive composition of genomes challenges the identification of target-specific oligonucleotides, which limits genetics and genomics research on many species. Here, a tool called ThermoAlign was developed that ensures the design of target-specific primer pairs for DNA amplification. This is achieved by evaluating the thermodynamics of hybridization for full-length oligonucleotide-template alignments — thermoalignments — across the genome to identify primers predicted to bind specifically to the target site. For amplification-based resequencing of regions that cannot be amplified by a single primer pair, a directed graph analysis method is used to identify minimum amplicon tiling paths. Laboratory validation by standard and long-range polymerase chain reaction and amplicon resequencing with maize, one of the most repetitive genomes sequenced to date (≈85% repeat content), demonstrated the specificity-by-design functionality of ThermoAlign. ThermoAlign is released under an open source license and bundled in a dependency-free container for wide distribution. It is anticipated that this tool will facilitate multiple applications in genetics and genomics and be useful in the workflow of high-throughput targeted resequencing studies.
Collapse
|
33
|
Abstract
Power-law distributions are the main functional form for the distribution of repeat size and repeat copy number in the human genome. When the genome is broken into fragments for sequencing, the limited size of fragments and reads may prevent an unique alignment of repeat sequences to the reference sequence. Repeats in the human genome can be as long as 104 bases, or 105 − 106 bases when allowing for mismatches between repeat units. Sequence reads from these regions are therefore unmappable when the read length is in the range of 103 bases. With a read length of 1000 bases, slightly more than 1% of the assembled genome, and slightly less than 1% of the 1 kb reads, are unmappable, excluding the unassembled portion of the human genome (8% in GRCh37/hg19). The slow decay (long tail) of the power-law function implies a diminishing return in converting unmappable regions/reads to become mappable with the increase of the read length, with the understanding that increasing read length will always move toward the direction of 100% mappability.
Collapse
Affiliation(s)
- Wentian Li
- The Robert S. Boas Center for Genomics and Human Genetics, The Feinstein Institute for Medical Research, North Shore LIJ Health System Manhasset, NY, USA
| | - Jan Freudenberg
- The Robert S. Boas Center for Genomics and Human Genetics, The Feinstein Institute for Medical Research, North Shore LIJ Health System Manhasset, NY, USA
| |
Collapse
|
34
|
Pratas D, Pinho AJ, Rodrigues JMOS. XS: a FASTQ read simulator. BMC Res Notes 2014; 7:40. [PMID: 24433564 PMCID: PMC3927261 DOI: 10.1186/1756-0500-7-40] [Citation(s) in RCA: 17] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/11/2013] [Accepted: 12/18/2013] [Indexed: 12/31/2022] Open
Abstract
Background The emerging next-generation sequencing (NGS) is bringing, besides the natural huge amounts of data, an avalanche of new specialized tools (for analysis, compression, alignment, among others) and large public and private network infrastructures. Therefore, a direct necessity of specific simulation tools for testing and benchmarking is rising, such as a flexible and portable FASTQ read simulator, without the need of a reference sequence, yet correctly prepared for producing approximately the same characteristics as real data. Findings We present XS, a skilled FASTQ read simulation tool, flexible, portable (does not need a reference sequence) and tunable in terms of sequence complexity. It has several running modes, depending on the time and memory available, and is aimed at testing computing infrastructures, namely cloud computing of large-scale projects, and testing FASTQ compression algorithms. Moreover, XS offers the possibility of simulating the three main FASTQ components individually (headers, DNA sequences and quality-scores). Conclusions XS provides an efficient and convenient method for fast simulation of FASTQ files, such as those from Ion Torrent (currently uncovered by other simulators), Roche-454, Illumina and ABI-SOLiD sequencing machines. This tool is publicly available at http://bioinformatics.ua.pt/software/xs/.
Collapse
Affiliation(s)
- Diogo Pratas
- Signal Processing Lab, IEETA/DETI University of Aveiro, Aveiro 3810-193, Portugal.
| | | | | |
Collapse
|
35
|
Roy RS, Chen KC, Sengupta AM, Schliep A. SLIQ: Simple Linear Inequalities for Efficient Contig Scaffolding. J Comput Biol 2012; 19:1162-75. [DOI: 10.1089/cmb.2011.0263] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/15/2022] Open
Affiliation(s)
- Rajat S. Roy
- Department of Computer Science, Rutgers The State University of New Jersey, Piscataway, NJ
| | - Kevin C. Chen
- Department of Genetics, BioMaPS Institute for Quantitative Biology, Rutgers The State University of New Jersey, Piscataway, NJ
| | - Anirvan M. Sengupta
- Department of Physics and Astronomy, BioMaPS Institute for Quantitative Biology, Rutgers The State University of New Jersey, Piscataway, NJ
| | - Alexander Schliep
- Department of Computer Science, BioMaPS Institute for Quantitative Biology, Rutgers The State University of New Jersey, Piscataway, NJ
| |
Collapse
|
36
|
Domazet-Lošo M, Haubold B. Alignment-free detection of local similarity among viral and bacterial genomes. ACTA ACUST UNITED AC 2011; 27:1466-72. [PMID: 21471011 DOI: 10.1093/bioinformatics/btr176] [Citation(s) in RCA: 42] [Impact Index Per Article: 3.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022]
Abstract
MOTIVATION Bacterial and viral genomes are often affected by horizontal gene transfer observable as abrupt switching in local homology. In addition to the resulting mosaic genome structure, they frequently contain regions not found in close relatives, which may play a role in virulence mechanisms. Due to this connection to medical microbiology, there are numerous methods available to detect horizontal gene transfer. However, these are usually aimed at individual genes and viral genomes rather than the much larger bacterial genomes. Here, we propose an efficient alignment-free approach to describe the mosaic structure of viral and bacterial genomes, including their unique regions. RESULTS Our method is based on the lengths of exact matches between pairs of sequences. Long matches indicate close homology, short matches more distant homology or none at all. These exact match lengths can be looked up efficiently using an enhanced suffix array. Our program implementing this approach, alfy (ALignment-Free local homologY), efficiently and accurately detects the recombination break points in simulated DNA sequences and among recombinant HIV-1 strains. We also apply alfy to Escherichia coli genomes where we detect new evidence for the hypothesis that strains pathogenic in poultry can infect humans. AVAILABILITY alfy is written in standard C and its source code is available under the GNU General Public License from http://guanine.evolbio.mpg.de/alfy/. The software package also includes documentation and example data.
Collapse
Affiliation(s)
- Mirjana Domazet-Lošo
- Department of Evolutionary Genetics, Max-Planck-Institute for Evolutionary Biology, 24306 Plön, Germany
| | | |
Collapse
|
37
|
Abstract
Background High-throughput short read sequencing is revolutionizing genomics and systems biology research by enabling cost-effective deep coverage sequencing of genomes and transcriptomes. Error detection and correction are crucial to many short read sequencing applications including de novo genome sequencing, genome resequencing, and digital gene expression analysis. Short read error detection is typically carried out by counting the observed frequencies of kmers in reads and validating those with frequencies exceeding a threshold. In case of genomes with high repeat content, an erroneous kmer may be frequently observed if it has few nucleotide differences with valid kmers with multiple occurrences in the genome. Error detection and correction were mostly applied to genomes with low repeat content and this remains a challenging problem for genomes with high repeat content. Results We develop a statistical model and a computational method for error detection and correction in the presence of genomic repeats. We propose a method to infer genomic frequencies of kmers from their observed frequencies by analyzing the misread relationships among observed kmers. We also propose a method to estimate the threshold useful for validating kmers whose estimated genomic frequency exceeds the threshold. We demonstrate that superior error detection is achieved using these methods. Furthermore, we break away from the common assumption of uniformly distributed errors within a read, and provide a framework to model position-dependent error occurrence frequencies common to many short read platforms. Lastly, we achieve better error correction in genomes with high repeat content. Availability: The software is implemented in C++ and is freely available under GNU GPL3 license and Boost Software V1.0 license at “http://aluru-sun.ece.iastate.edu/doku.php?id=redeem”. Conclusions We introduce a statistical framework to model sequencing errors in next-generation reads, which led to promising results in detecting and correcting errors for genomes with high repeat content.
Collapse
Affiliation(s)
- Xiao Yang
- Department of Electrical and Computer Engineering, Iowa State University, Ames, Iowa 50011, USA.
| | | | | |
Collapse
|
38
|
Read length and repeat resolution: exploring prokaryote genomes using next-generation sequencing technologies. PLoS One 2010; 5:e11518. [PMID: 20634954 PMCID: PMC2902515 DOI: 10.1371/journal.pone.0011518] [Citation(s) in RCA: 25] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/21/2010] [Accepted: 05/31/2010] [Indexed: 11/19/2022] Open
Abstract
Background There are a growing number of next-generation sequencing technologies. At present, the most cost-effective options also produce the shortest reads. However, even for prokaryotes, there is uncertainty concerning the utility of these technologies for the de novo assembly of complete genomes. This reflects an expectation that short reads will be unable to resolve small, but presumably abundant, repeats. Methodology/Principal Findings Using a simple model of repeat assembly, we develop and test a technique that, for any read length, can estimate the occurrence of unresolvable repeats in a genome, and thus predict the number of gaps that would need to be closed to produce a complete sequence. We apply this technique to 818 prokaryote genome sequences. This provides a quantitative assessment of the relative performance of various lengths. Notably, unpaired reads of only 150nt can reconstruct approximately 50% of the analysed genomes with fewer than 96 repeat-induced gaps. Nonetheless, there is considerable variation amongst prokaryotes. Some genomes can be assembled to near contiguity using very short reads while others require much longer reads. Conclusions Given the diversity of prokaryote genomes, a sequencing strategy should be tailored to the organism under study. Our results will provide researchers with a practical resource to guide the selection of the appropriate read length.
Collapse
|
39
|
Kingsford C, Schatz MC, Pop M. Assembly complexity of prokaryotic genomes using short reads. BMC Bioinformatics 2010; 11:21. [PMID: 20064276 PMCID: PMC2821320 DOI: 10.1186/1471-2105-11-21] [Citation(s) in RCA: 89] [Impact Index Per Article: 6.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/03/2009] [Accepted: 01/12/2010] [Indexed: 01/08/2023] Open
Abstract
Background De Bruijn graphs are a theoretical framework underlying several modern genome assembly programs, especially those that deal with very short reads. We describe an application of de Bruijn graphs to analyze the global repeat structure of prokaryotic genomes. Results We provide the first survey of the repeat structure of a large number of genomes. The analysis gives an upper-bound on the performance of genome assemblers for de novo reconstruction of genomes across a wide range of read lengths. Further, we demonstrate that the majority of genes in prokaryotic genomes can be reconstructed uniquely using very short reads even if the genomes themselves cannot. The non-reconstructible genes are overwhelmingly related to mobile elements (transposons, IS elements, and prophages). Conclusions Our results improve upon previous studies on the feasibility of assembly with short reads and provide a comprehensive benchmark against which to compare the performance of the short-read assemblers currently being developed.
Collapse
Affiliation(s)
- Carl Kingsford
- Department of Computer Science, Institute for Advanced Computer Studies, University of Maryland, College Park, MD, USA.
| | | | | |
Collapse
|
40
|
Haubold B, Pfaffelhuber P, Domazet-Loso M, Wiehe T. Estimating mutation distances from unaligned genomes. J Comput Biol 2009; 16:1487-500. [PMID: 19803738 DOI: 10.1089/cmb.2009.0106] [Citation(s) in RCA: 78] [Impact Index Per Article: 5.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
Abstract Alignment-free distance measures are generally less accurate but more efficient than traditional alignment-based metrics. In the context of genome sequence analysis, the efficiency gain is often so substantial that it outweights the loss in accuracy. However, a further disadvantage of alignment-free distances is that their relationship to evolutionary events such as substitutions is generally unknown. We have therefore derived an estimator of the number of substitutions per site between two unaligned DNA sequences, K(r). Simulations show that this estimator works well with "ideal" data. We compare K(r) to two alternative alignment-free distances: a k-tuple distance and a measure of relative entropy based on average common substring length. All three measures are applied to 27 primate mitochondrial genomes, eight whole genomes of Streptococcus agalactiae strains, and 12 whole genomes of Drosophila species. In each case, the cluster diagrams based on K(r) are equivalent to or significantly better than those based on the two alternative measures. This is due to the fact that in contrast to the alternative measures K(r) is derived from an explicit model of evolution. The computation of K(r) is efficiently implemented in the program kr, which can be downloaded freely from the internet.
Collapse
Affiliation(s)
- Bernhard Haubold
- Department of Evolutionary Genetics, Max-Planck-Institute for Evolutionary Biology, Plön, Germany.
| | | | | | | |
Collapse
|
41
|
Kerstens HHD, Crooijmans RPMA, Veenendaal A, Dibbits BW, Chin-A-Woeng TFC, den Dunnen JT, Groenen MAM. Large scale single nucleotide polymorphism discovery in unsequenced genomes using second generation high throughput sequencing technology: applied to turkey. BMC Genomics 2009; 10:479. [PMID: 19835600 PMCID: PMC2772860 DOI: 10.1186/1471-2164-10-479] [Citation(s) in RCA: 64] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/15/2009] [Accepted: 10/16/2009] [Indexed: 01/18/2023] Open
Abstract
BACKGROUND The development of second generation sequencing methods has enabled large scale DNA variation studies at moderate cost. For the high throughput discovery of single nucleotide polymorphisms (SNPs) in species lacking a sequenced reference genome, we set-up an analysis pipeline based on a short read de novo sequence assembler and a program designed to identify variation within short reads. To illustrate the potential of this technique, we present the results obtained with a randomly sheared, enzymatically generated, 2-3 kbp genome fraction of six pooled Meleagris gallopavo (turkey) individuals. RESULTS A total of 100 million 36 bp reads were generated, representing approximately 5-6% (approximately 62 Mbp) of the turkey genome, with an estimated sequence depth of 58. Reads consisting of bases called with less than 1% error probability were selected and assembled into contigs. Subsequently, high throughput discovery of nucleotide variation was performed using sequences with more than 90% reliability by using the assembled contigs that were 50 bp or longer as the reference sequence. We identified more than 7,500 SNPs with a high probability of representing true nucleotide variation in turkeys. Increasing the reference genome by adding publicly available turkey BAC-end sequences increased the number of SNPs to over 11,000. A comparison with the sequenced chicken genome indicated that the assembled turkey contigs were distributed uniformly across the turkey genome. Genotyping of a representative sample of 340 SNPs resulted in a SNP conversion rate of 95%. The correlation of the minor allele count (MAC) and observed minor allele frequency (MAF) for the validated SNPs was 0.69. CONCLUSION We provide an efficient and cost-effective approach for the identification of thousands of high quality SNPs in species currently lacking a sequenced genome and applied this to turkey. The methodology addresses a random fraction of the genome, resulting in an even distribution of SNPs across the targeted genome.
Collapse
Affiliation(s)
- Hindrik H D Kerstens
- Animal Breeding and Genomics Center, Wageningen University, Marijkeweg 40, Wageningen, 6709 PG, the Netherlands.
| | | | | | | | | | | | | |
Collapse
|
42
|
Domazet-Loso M, Haubold B. Efficient estimation of pairwise distances between genomes. ACTA ACUST UNITED AC 2009; 25:3221-7. [PMID: 19825795 DOI: 10.1093/bioinformatics/btp590] [Citation(s) in RCA: 38] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022]
Abstract
MOTIVATION Genome comparison is central to contemporary genomics and typically relies on sequence alignment. However, genome-wide alignments are difficult to compute. We have, therefore, recently developed an accurate alignment-free estimator of the number of substitutions per site based on the lengths of exact matches between pairs of sequences. The previous implementation of this measure requires n(n-1) suffix tree constructions and traversals, where n is the number of sequences analyzed. This does not scale well for large n. RESULTS We present an algorithm to extract pairwise distances in a single traversal of a single suffix tree containing n sequences. As a result, the run time of the suffix tree construction phase of our algorithm is reduced from O(n(2)L) to O(nL), where L is the length of each sequence. We implement this algorithm in the program kr version 2 and apply it to 825 HIV genomes, 13 genomes of enterobacteria and the complete genomes of 12 Drosophila species. We show that, depending on the input dataset, the new program is at least 10 times faster than its predecessor. AVAILABILITY Version 2 of kr can be tested via a web interface at http://guanine.evolbio.mpg.de/kr2/. It is written in standard C and its source code is available under the GNU General Public License from the same web site. CONTACT haubold@evolbio.mpg.de Supplementary informations: Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Mirjana Domazet-Loso
- Department of Evolutionary Genetics, Max-Planck-Institute for Evolutionary Biology, 24306 Plön, Germany
| | | |
Collapse
|
43
|
Treangen TJ, Abraham AL, Touchon M, Rocha EPC. Genesis, effects and fates of repeats in prokaryotic genomes. FEMS Microbiol Rev 2009; 33:539-71. [PMID: 19396957 DOI: 10.1111/j.1574-6976.2009.00169.x] [Citation(s) in RCA: 111] [Impact Index Per Article: 7.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/06/2023] Open
Abstract
DNA repeats are causes and consequences of genome plasticity. Repeats are created by intrachromosomal recombination or horizontal transfer. They are targeted by recombination processes leading to amplifications, deletions and rearrangements of genetic material. The identification and analysis of repeats in nearly 700 genomes of bacteria and archaea is facilitated by the existence of sequence data and adequate bioinformatic tools. These have revealed the immense diversity of repeats in genomes, from those created by selfish elements to the ones used for protection against selfish elements, from those arising from transient gene amplifications to the ones leading to stable duplications. Experimental works have shown that some repeats do not carry any adaptive value, while others allow functional diversification and increased expression. All repeats carry some potential to disorganize and destabilize genomes. Because recombination and selection for repeats vary between genomes, the number and types of repeats are also quite diverse and in line with ecological variables, such as host-dependent associations or population sizes, and with genetic variables, such as the recombination machinery. From an evolutionary point of view, repeats represent both opportunities and problems. We describe how repeats are created and how they can be found in genomes. We then focus on the functional and genomic consequences of repeats that dictate their fate.
Collapse
|
44
|
Paar V, Pavin N, Basar I, Rosandić M, Gluncić M, Paar N. Hierarchical structure of cascade of primary and secondary periodicities in Fourier power spectrum of alphoid higher order repeats. BMC Bioinformatics 2008; 9:466. [PMID: 18980673 PMCID: PMC2661002 DOI: 10.1186/1471-2105-9-466] [Citation(s) in RCA: 11] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/12/2008] [Accepted: 11/03/2008] [Indexed: 11/28/2022] Open
Abstract
Background Identification of approximate tandem repeats is an important task of broad significance and still remains a challenging problem of computational genomics. Often there is no single best approach to periodicity detection and a combination of different methods may improve the prediction accuracy. Discrete Fourier transform (DFT) has been extensively used to study primary periodicities in DNA sequences. Here we investigate the application of DFT method to identify and study alphoid higher order repeats. Results We used method based on DFT with mapping of symbolic into numerical sequence to identify and study alphoid higher order repeats (HOR). For HORs the power spectrum shows equidistant frequency pattern, with characteristic two-level hierarchical organization as signature of HOR. Our case study was the 16 mer HOR tandem in AC017075.8 from human chromosome 7. Very long array of equidistant peaks at multiple frequencies (more than a thousand higher harmonics) is based on fundamental frequency of 16 mer HOR. Pronounced subset of equidistant peaks is based on multiples of the fundamental HOR frequency (multiplication factor n for nmer) and higher harmonics. In general, nmer HOR-pattern contains equidistant secondary periodicity peaks, having a pronounced subset of equidistant primary periodicity peaks. This hierarchical pattern as signature for HOR detection is robust with respect to monomer insertions and deletions, random sequence insertions etc. For a monomeric alphoid sequence only primary periodicity peaks are present. The 1/fβ – noise and periodicity three pattern are missing from power spectra in alphoid regions, in accordance with expectations. Conclusion DFT provides a robust detection method for higher order periodicity. Easily recognizable HOR power spectrum is characterized by hierarchical two-level equidistant pattern: higher harmonics of the fundamental HOR-frequency (secondary periodicity) and a subset of pronounced peaks corresponding to constituent monomers (primary periodicity). The number of lower frequency peaks (secondary periodicity) below the frequency of the first primary periodicity peak reveals the size of nmer HOR, i.e., the number n of monomers contained in consensus HOR.
Collapse
Affiliation(s)
- Vladimir Paar
- Faculty of Science, University of Zagreb, Bijenicka 32, Zagreb, Croatia.
| | | | | | | | | | | |
Collapse
|
45
|
Vinga S, Almeida JS. Local Renyi entropic profiles of DNA sequences. BMC Bioinformatics 2007; 8:393. [PMID: 17939871 PMCID: PMC2238722 DOI: 10.1186/1471-2105-8-393] [Citation(s) in RCA: 23] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/10/2007] [Accepted: 10/16/2007] [Indexed: 11/18/2022] Open
Abstract
Background In a recent report the authors presented a new measure of continuous entropy for DNA sequences, which allows the estimation of their randomness level. The definition therein explored was based on the Rényi entropy of probability density estimation (pdf) using the Parzen's window method and applied to Chaos Game Representation/Universal Sequence Maps (CGR/USM). Subsequent work proposed a fractal pdf kernel as a more exact solution for the iterated map representation. This report extends the concepts of continuous entropy by defining DNA sequence entropic profiles using the new pdf estimations to refine the density estimation of motifs. Results The new methodology enables two results. On the one hand it shows that the entropic profiles are directly related with the statistical significance of motifs, allowing the study of under and over-representation of segments. On the other hand, by spanning the parameters of the kernel function it is possible to extract important information about the scale of each conserved DNA region. The computational applications, developed in Matlab m-code, the corresponding binary executables and additional material and examples are made publicly available at . Conclusion The ability to detect local conservation from a scale-independent representation of symbolic sequences is particularly relevant for biological applications where conserved motifs occur in multiple, overlapping scales, with significant future applications in the recognition of foreign genomic material and inference of motif structures.
Collapse
Affiliation(s)
- Susana Vinga
- Instituto de Engenharia de Sistemas e Computadores: Investigação e Desenvolvimento (INESC-ID), R, Alves Redol 9, 1000-029 Lisboa, Portugal.
| | | |
Collapse
|