1
|
Ben Shabat D, Hadad A, Boruchovsky A, Yaakobi E. GradHC: highly reliable gradual hash-based clustering for DNA storage systems. BIOINFORMATICS (OXFORD, ENGLAND) 2024; 40:btae274. [PMID: 38648049 DOI: 10.1093/bioinformatics/btae274] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 11/07/2023] [Revised: 03/27/2024] [Accepted: 04/17/2024] [Indexed: 04/25/2024]
Abstract
MOTIVATION As data storage challenges grow and existing technologies approach their limits, synthetic DNA emerges as a promising storage solution due to its remarkable density and durability advantages. While cost remains a concern, emerging sequencing and synthetic technologies aim to mitigate it, yet introduce challenges such as errors in the storage and retrieval process. One crucial task in a DNA storage system is clustering numerous DNA reads into groups that represent the original input strands. RESULTS In this paper, we review different methods for evaluating clustering algorithms and introduce a novel clustering algorithm for DNA storage systems, named Gradual Hash-based clustering (GradHC). The primary strength of GradHC lies in its capability to cluster with excellent accuracy various types of designs, including varying strand lengths, cluster sizes (including extremely small clusters), and different error ranges. Benchmark analysis demonstrates that GradHC is significantly more stable and robust than other clustering algorithms previously proposed for DNA storage, while also producing highly reliable clustering results. AVAILABILITY AND IMPLEMENTATION https://github.com/bensdvir/GradHC.
Collapse
Affiliation(s)
- Dvir Ben Shabat
- Department of Computer Science, Technion, Haifa 320003, Israel
| | - Adar Hadad
- Department of Computer Science, Technion, Haifa 320003, Israel
| | | | - Eitan Yaakobi
- Department of Computer Science, Technion, Haifa 320003, Israel
| |
Collapse
|
2
|
Cao B, Zheng Y, Shao Q, Liu Z, Xie L, Zhao Y, Wang B, Zhang Q, Wei X. Efficient data reconstruction: The bottleneck of large-scale application of DNA storage. Cell Rep 2024; 43:113699. [PMID: 38517891 DOI: 10.1016/j.celrep.2024.113699] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/09/2023] [Revised: 11/15/2023] [Accepted: 01/05/2024] [Indexed: 03/24/2024] Open
Abstract
Over the past decade, the rapid development of DNA synthesis and sequencing technologies has enabled preliminary use of DNA molecules for digital data storage, overcoming the capacity and persistence bottlenecks of silicon-based storage media. DNA storage has now been fully accomplished in the laboratory through existing biotechnology, which again demonstrates the viability of carbon-based storage media. However, the high cost and latency of data reconstruction pose challenges that hinder the practical implementation of DNA storage beyond the laboratory. In this article, we review existing advanced DNA storage methods, analyze the characteristics and performance of biotechnological approaches at various stages of data writing and reading, and discuss potential factors influencing DNA storage from the perspective of data reconstruction.
Collapse
Affiliation(s)
- Ben Cao
- School of Computer Science and Technology, Dalian University of Technology, Lingshui Street, Dalian, Liaoning 116024, China; Centre for Frontier AI Research, Agency for Science, Technology, and Research (A(∗)STAR), 1 Fusionopolis Way, Singapore 138632, Singapore
| | - Yanfen Zheng
- School of Computer Science and Technology, Dalian University of Technology, Lingshui Street, Dalian, Liaoning 116024, China
| | - Qi Shao
- Key Laboratory of Advanced Design and Intelligent Computing, Ministry of Education, School of Software Engineering, Dalian University, Xuefu Street, Dalian, Liaoning 116622, China
| | - Zhenlu Liu
- Key Laboratory of Advanced Design and Intelligent Computing, Ministry of Education, School of Software Engineering, Dalian University, Xuefu Street, Dalian, Liaoning 116622, China
| | - Lei Xie
- Key Laboratory of Advanced Design and Intelligent Computing, Ministry of Education, School of Software Engineering, Dalian University, Xuefu Street, Dalian, Liaoning 116622, China
| | - Yunzhu Zhao
- Key Laboratory of Advanced Design and Intelligent Computing, Ministry of Education, School of Software Engineering, Dalian University, Xuefu Street, Dalian, Liaoning 116622, China
| | - Bin Wang
- Key Laboratory of Advanced Design and Intelligent Computing, Ministry of Education, School of Software Engineering, Dalian University, Xuefu Street, Dalian, Liaoning 116622, China
| | - Qiang Zhang
- School of Computer Science and Technology, Dalian University of Technology, Lingshui Street, Dalian, Liaoning 116024, China.
| | - Xiaopeng Wei
- School of Computer Science and Technology, Dalian University of Technology, Lingshui Street, Dalian, Liaoning 116024, China
| |
Collapse
|
3
|
Xu C, Li J, Song LY, Guo ZJ, Song SW, Zhang LD, Zheng HL. PlantC2U: deep learning of cross-species sequence landscapes predicts plastid C-to-U RNA editing in plants. JOURNAL OF EXPERIMENTAL BOTANY 2024; 75:2266-2279. [PMID: 38190348 DOI: 10.1093/jxb/erae007] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 10/29/2023] [Accepted: 01/07/2024] [Indexed: 01/10/2024]
Abstract
In plants, C-to-U RNA editing mainly occurs in plastid and mitochondrial transcripts, which contributes to a complex transcriptional regulatory network. More evidence reveals that RNA editing plays critical roles in plant growth and development. However, accurate detection of RNA editing sites using transcriptome sequencing data alone is still challenging. In the present study, we develop PlantC2U, which is a convolutional neural network, to predict plastid C-to-U RNA editing based on the genomic sequence. PlantC2U achieves >95% sensitivity and 99% specificity, which outperforms the PREPACT tool, random forests, and support vector machines. PlantC2U not only further checks RNA editing sites from transcriptome data to reduce possible false positives, but also assesses the effect of different mutations on C-to-U RNA editing based on the flanking sequences. Moreover, we found the patterns of tissue-specific RNA editing in the mangrove plant Kandelia obovata, and observed reduced C-to-U RNA editing rates in the cold stress response of K. obovata, suggesting their potential regulatory roles in plant stress adaptation. In addition, we present RNAeditDB, available online at https://jasonxu.shinyapps.io/RNAeditDB/. Together, PlantC2U and RNAeditDB will help researchers explore the RNA editing events in plants and thus will be of broad utility for the plant research community.
Collapse
Affiliation(s)
- Chaoqun Xu
- Key Laboratory of the Ministry of Education for Coastal and Wetland Ecosystems, College of the Environment and Ecology, Xiamen University, Xiamen 361102, China
| | - Jing Li
- Key Laboratory of the Ministry of Education for Coastal and Wetland Ecosystems, College of the Environment and Ecology, Xiamen University, Xiamen 361102, China
| | - Ling-Yu Song
- Key Laboratory of the Ministry of Education for Coastal and Wetland Ecosystems, College of the Environment and Ecology, Xiamen University, Xiamen 361102, China
| | - Ze-Jun Guo
- Key Laboratory of the Ministry of Education for Coastal and Wetland Ecosystems, College of the Environment and Ecology, Xiamen University, Xiamen 361102, China
| | - Shi-Wei Song
- Key Laboratory of the Ministry of Education for Coastal and Wetland Ecosystems, College of the Environment and Ecology, Xiamen University, Xiamen 361102, China
| | - Lu-Dan Zhang
- Key Laboratory of the Ministry of Education for Coastal and Wetland Ecosystems, College of the Environment and Ecology, Xiamen University, Xiamen 361102, China
| | - Hai-Lei Zheng
- Key Laboratory of the Ministry of Education for Coastal and Wetland Ecosystems, College of the Environment and Ecology, Xiamen University, Xiamen 361102, China
| |
Collapse
|
4
|
Wright E. Accurately clustering biological sequences in linear time by relatedness sorting. Nat Commun 2024; 15:3047. [PMID: 38589369 PMCID: PMC11001989 DOI: 10.1038/s41467-024-47371-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/27/2022] [Accepted: 03/28/2024] [Indexed: 04/10/2024] Open
Abstract
Clustering biological sequences into similar groups is an increasingly important task as the number of available sequences continues to grow exponentially. Search-based approaches to clustering scale super-linearly with the number of input sequences, making it impractical to cluster very large sets of sequences. Approaches to clustering sequences in linear time currently lack the accuracy of super-linear approaches. Here, I set out to develop and characterize a strategy for clustering with linear time complexity that retains the accuracy of less scalable approaches. The resulting algorithm, named Clusterize, sorts sequences by relatedness to linearize the clustering problem. Clusterize produces clusters with accuracy rivaling popular programs (CD-HIT, MMseqs2, and UCLUST) but exhibits linear asymptotic scalability. Clusterize generates higher accuracy and oftentimes much larger clusters than Linclust, a fast linear time clustering algorithm. I demonstrate the utility of Clusterize by accurately solving different clustering problems involving millions of nucleotide or protein sequences.
Collapse
Affiliation(s)
- Erik Wright
- Department of Biomedical Informatics, University of Pittsburgh, Pittsburgh, PA, USA.
- Center for Evolutionary Biology and Medicine, Pittsburgh, PA, USA.
| |
Collapse
|
5
|
Alipour F, Holmes C, Lu YY, Hill KA, Kari L. Leveraging machine learning for taxonomic classification of emerging astroviruses. Front Mol Biosci 2024; 10:1305506. [PMID: 38274100 PMCID: PMC10808839 DOI: 10.3389/fmolb.2023.1305506] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/01/2023] [Accepted: 12/12/2023] [Indexed: 01/27/2024] Open
Abstract
Astroviruses are a family of genetically diverse viruses associated with disease in humans and birds with significant health effects and economic burdens. Astrovirus taxonomic classification includes two genera, Avastrovirus and Mamastrovirus. However, with next-generation sequencing, broader interspecies transmission has been observed necessitating a reexamination of the current host-based taxonomic classification approach. In this study, a novel taxonomic classification method is presented for emergent and as yet unclassified astroviruses, based on whole genome sequence k-mer composition in addition to host information. An optional component responsible for identifying recombinant sequences was added to the method's pipeline, to counteract the impact of genetic recombination on viral classification. The proposed three-pronged classification method consists of a supervised machine learning method, an unsupervised machine learning method, and the consideration of host species. Using this three-pronged approach, we propose genus labels for 191 as yet unclassified astrovirus genomes. Genus labels are also suggested for an additional eight as yet unclassified astrovirus genomes for which incompatibility was observed with the host species, suggesting cross-species infection. Lastly, our machine learning-based approach augmented by a principal component analysis (PCA) analysis provides evidence supporting the hypothesis of the existence of human astrovirus (HAstV) subgenus of the genus Mamastrovirus, and a goose astrovirus (GoAstV) subgenus of the genus Avastrovirus. Overall, this multipronged machine learning approach provides a fast, reliable, and scalable prediction method of taxonomic labels, able to keep pace with emerging viruses and the exponential increase in the output of modern genome sequencing technologies.
Collapse
Affiliation(s)
- Fatemeh Alipour
- School of Computer Science, University of Waterloo, Waterloo, ON, Canada
| | - Connor Holmes
- Department of Biology, University of Western Ontario, London, ON, Canada
| | - Yang Young Lu
- School of Computer Science, University of Waterloo, Waterloo, ON, Canada
| | - Kathleen A. Hill
- Department of Biology, University of Western Ontario, London, ON, Canada
| | - Lila Kari
- School of Computer Science, University of Waterloo, Waterloo, ON, Canada
| |
Collapse
|
6
|
Han R, Qi J, Xue Y, Sun X, Zhang F, Gao X, Li G. HycDemux: a hybrid unsupervised approach for accurate barcoded sample demultiplexing in nanopore sequencing. Genome Biol 2023; 24:222. [PMID: 37798751 PMCID: PMC10552309 DOI: 10.1186/s13059-023-03053-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/09/2022] [Accepted: 09/08/2023] [Indexed: 10/07/2023] Open
Abstract
DNA barcodes enable Oxford Nanopore sequencing to sequence multiple barcoded DNA samples on a single flow cell. DNA sequences with the same barcode need to be grouped together through demultiplexing. As the number of samples increases, accurate demultiplexing becomes difficult. We introduce HycDemux, which incorporates a GPU-parallelized hybrid clustering algorithm that uses nanopore signals and DNA sequences for accurate data clustering, alongside a voting-based module to finalize the demultiplexing results. Comprehensive experiments demonstrate that our approach outperforms unsupervised tools in short sequence fragment clustering and performs more robustly than current state-of-the-art demultiplexing tools for complex multi-sample sequencing data.
Collapse
Affiliation(s)
- Renmin Han
- Research Center for Mathematics and Interdisciplinary Sciences, Shandong University, Qingdao, 266237, China
| | - Junhai Qi
- Research Center for Mathematics and Interdisciplinary Sciences, Shandong University, Qingdao, 266237, China
- BioMap Research, California, USA
| | - Yang Xue
- Research Center for Mathematics and Interdisciplinary Sciences, Shandong University, Qingdao, 266237, China
| | - Xiujuan Sun
- High Performance Computer Research Center, Institute of Computing Technology, Chinese Academy of Sciences, Beijing, 100190, China
| | - Fa Zhang
- School of Medical Technolgoy, Beijing Institute of Technology, Beijing, 100085, China.
| | - Xin Gao
- King Abdullah University of Science and Technology (KAUST), Computational Bioscience Research Center (CBRC), Computer, Electrical and Mathematical Sciences and Engineering (CEMSE) Division, Thuwal, 23955, Saudi Arabia.
| | - Guojun Li
- Research Center for Mathematics and Interdisciplinary Sciences, Shandong University, Qingdao, 266237, China.
| |
Collapse
|
7
|
Millan Arias P, Hill KA, Kari L. iDeLUCS: a deep learning interactive tool for alignment-free clustering of DNA sequences. Bioinformatics 2023; 39:btad508. [PMID: 37589603 PMCID: PMC10483029 DOI: 10.1093/bioinformatics/btad508] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/13/2022] [Revised: 07/18/2023] [Accepted: 08/16/2023] [Indexed: 08/18/2023] Open
Abstract
SUMMARY We present an interactive Deep Learning-based software tool for Unsupervised Clustering of DNA Sequences (iDeLUCS), that detects genomic signatures and uses them to cluster DNA sequences, without the need for sequence alignment or taxonomic identifiers. iDeLUCS is scalable and user-friendly: its graphical user interface, with support for hardware acceleration, allows the practitioner to fine-tune the different hyper-parameters involved in the training process without requiring extensive knowledge of deep learning. The performance of iDeLUCS was evaluated on a diverse set of datasets: several real genomic datasets from organisms in kingdoms Animalia, Protista, Fungi, Bacteria, and Archaea, three datasets of viral genomes, a dataset of simulated metagenomic reads from microbial genomes, and multiple datasets of synthetic DNA sequences. The performance of iDeLUCS was compared to that of two classical clustering algorithms (k-means++ and GMM) and two clustering algorithms specialized in DNA sequences (MeShClust v3.0 and DeLUCS), using both intrinsic cluster evaluation metrics and external evaluation metrics. In terms of unsupervised clustering accuracy, iDeLUCS outperforms the two classical algorithms by an average of ∼20%, and the two specialized algorithms by an average of ∼12%, on the datasets of real DNA sequences analyzed. Overall, our results indicate that iDeLUCS is a robust clustering method suitable for the clustering of large and diverse datasets of unlabeled DNA sequences. AVAILABILITY AND IMPLEMENTATION iDeLUCS is available at https://github.com/Kari-Genomics-Lab/iDeLUCS under the terms of the MIT licence.
Collapse
Affiliation(s)
- Pablo Millan Arias
- Cheriton School of Computer Science, University of Waterloo, Waterloo, ON N2L 3G1, Canada
| | - Kathleen A Hill
- Department of Biology, University of Western Ontario, London, ON N6A 5B7, Canada
| | - Lila Kari
- Cheriton School of Computer Science, University of Waterloo, Waterloo, ON N2L 3G1, Canada
| |
Collapse
|
8
|
Wei ZG, Chen X, Zhang XD, Zhang H, Fan XG, Gao HY, Liu F, Qian Y. Comparison of Methods for Biological Sequence Clustering. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2023; 20:2874-2888. [PMID: 37028305 DOI: 10.1109/tcbb.2023.3253138] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/19/2023]
Abstract
Recent advances in sequencing technology have considerably promoted genomics research by providing high-throughput sequencing economically. This great advancement has resulted in a huge amount of sequencing data. Clustering analysis is powerful to study and probe the large-scale sequence data. A number of available clustering methods have been developed in the last decade. Despite numerous comparison studies being published, we noticed that they have two main limitations: only traditional alignment-based clustering methods are compared and the evaluation metrics heavily rely on labeled sequence data. In this study, we present a comprehensive benchmark study for sequence clustering methods. Specifically, i) alignment-based clustering algorithms including classical (e.g., CD-HIT, UCLUST, VSEARCH) and recently proposed methods (e.g., MMseq2, Linclust, edClust) are assessed; ii) two alignment-free methods (e.g., LZW-Kernel and Mash) are included to compare with alignment-based methods; and iii) different evaluation measures based on the true labels (supervised metrics) and the input data itself (unsupervised metrics) are applied to quantify their clustering results. The aims of this study are to help biological analyzers in choosing one reasonable clustering algorithm for processing their collected sequences, and furthermore, motivate algorithm designers to develop more efficient sequence clustering approaches.
Collapse
|
9
|
Wang P, Cao B, Ma T, Wang B, Zhang Q, Zheng P. DUHI: Dynamically updated hash index clustering method for DNA storage. Comput Biol Med 2023; 164:107244. [PMID: 37453377 DOI: 10.1016/j.compbiomed.2023.107244] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/28/2023] [Revised: 06/08/2023] [Accepted: 07/07/2023] [Indexed: 07/18/2023]
Abstract
The exponential growth of global data leads to the problem of insufficient data storage capacity. DNA storage can be an ideal storage method due to its high storage density and long storage time. However, the DNA storage process is subject to unavoidable errors that can lead to increased cluster redundancy during data reading, which in turn affects the accuracy of the data reads. This paper proposes a dynamically updated hash index (DUHI) clustering method for DNA storage, which clusters sequences by constructing a dynamic core index set and using hash lookup. The proposed clustering method is analyzed in terms of overall reliability evaluation and visualization evaluation. The results show that the DUHI clustering method can reduce the redundancy of more than 10% of the sequences within the cluster and increase the reconstruction rate of the sequences to more than 99%. Therefore, our method solves the high redundancy problem after DNA sequence clustering, improves the accuracy of data reading, and promotes the development of DNA storage.
Collapse
Affiliation(s)
- Penghao Wang
- Key Laboratory of Advanced Design and Intelligent Computing, Ministry of Education, School of Software Engineering, Dalian University, 116622, Dalian, China
| | - Ben Cao
- School of Computer Science and Technology, Dalian University of Technology, 116024, Dalian, China
| | - Tao Ma
- Brain Function Research Section, The First Hospital of China Medical University, 110001, Shenyang, China
| | - Bin Wang
- Key Laboratory of Advanced Design and Intelligent Computing, Ministry of Education, School of Software Engineering, Dalian University, 116622, Dalian, China.
| | - Qiang Zhang
- Key Laboratory of Advanced Design and Intelligent Computing, Ministry of Education, School of Software Engineering, Dalian University, 116622, Dalian, China
| | - Pan Zheng
- Department of Accounting and Information Systems, University of Canterbury, 8140, Christchurch, New Zealand
| |
Collapse
|
10
|
Johnson MS, Venkataram S, Kryazhimskiy S. Best Practices in Designing, Sequencing, and Identifying Random DNA Barcodes. J Mol Evol 2023; 91:263-280. [PMID: 36651964 PMCID: PMC10276077 DOI: 10.1007/s00239-022-10083-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/08/2022] [Accepted: 12/15/2022] [Indexed: 01/19/2023]
Abstract
Random DNA barcodes are a versatile tool for tracking cell lineages, with applications ranging from development to cancer to evolution. Here, we review and critically evaluate barcode designs as well as methods of barcode sequencing and initial processing of barcode data. We first demonstrate how various barcode design decisions affect data quality and propose a new design that balances all considerations that we are currently aware of. We then discuss various options for the preparation of barcode sequencing libraries, including inline indices and Unique Molecular Identifiers (UMIs). Finally, we test the performance of several established and new bioinformatic pipelines for the extraction of barcodes from raw sequencing reads and for error correction. We find that both alignment and regular expression-based approaches work well for barcode extraction, and that error-correction pipelines designed specifically for barcode data are superior to generic ones. Overall, this review will help researchers to approach their barcoding experiments in a deliberate and systematic way.
Collapse
Affiliation(s)
- Milo S Johnson
- Department of Integrative Biology, University of California Berkeley, Berkeley, CA, 94720, USA
| | - Sandeep Venkataram
- Department of Ecology, Behavior and Evolution, University of California San Diego, La Jolla, CA, 92093, USA
| | - Sergey Kryazhimskiy
- Department of Ecology, Behavior and Evolution, University of California San Diego, La Jolla, CA, 92093, USA.
| |
Collapse
|
11
|
Xu X, Yin Z, Yan L, Zhang H, Xu B, Wei Y, Niu B, Schmidt B, Liu W. RabbitTClust: enabling fast clustering analysis of millions of bacteria genomes with MinHash sketches. Genome Biol 2023; 24:121. [PMID: 37198663 PMCID: PMC10190105 DOI: 10.1186/s13059-023-02961-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/14/2022] [Accepted: 05/05/2023] [Indexed: 05/19/2023] Open
Abstract
We present RabbitTClust, a fast and memory-efficient genome clustering tool based on sketch-based distance estimation. Our approach enables efficient processing of large-scale datasets by combining dimensionality reduction techniques with streaming and parallelization on modern multi-core platforms. 113,674 complete bacterial genome sequences from RefSeq, 455 GB in FASTA format, can be clustered within less than 6 min and 1,009,738 GenBank assembled bacterial genomes, 4.0 TB in FASTA format, within only 34 min on a 128-core workstation. Our results further identify 1269 redundant genomes, with identical nucleotide content, in the RefSeq bacterial genomes database.
Collapse
Affiliation(s)
- Xiaoming Xu
- School of Software, Shandong University, Jinan, China
| | - Zekun Yin
- School of Software, Shandong University, Jinan, China
- Shenzhen Research Institute of Shandong University, Shandong University, Shenzhen, China
| | - Lifeng Yan
- School of Software, Shandong University, Jinan, China
- Shenzhen Research Institute of Shandong University, Shandong University, Shenzhen, China
| | - Hao Zhang
- School of Software, Shandong University, Jinan, China
- Shenzhen Research Institute of Shandong University, Shandong University, Shenzhen, China
| | - Borui Xu
- School of Software, Shandong University, Jinan, China
| | - Yanjie Wei
- Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences, Shenzhen, China
| | - Beifang Niu
- Computer Network Information Center, Chinese Academy of Sciences, Beijing, China
| | - Bertil Schmidt
- Institute for Computer Science, Johannes Gutenberg University, Mainz, Germany
| | - Weiguo Liu
- School of Software, Shandong University, Jinan, China
| |
Collapse
|
12
|
Luan T, Muralidharan HS, Alshehri M, Mittra I, Pop M. SCRAPT: an iterative algorithm for clustering large 16S rRNA gene data sets. Nucleic Acids Res 2023; 51:e46. [PMID: 36912074 PMCID: PMC10164572 DOI: 10.1093/nar/gkad158] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/08/2022] [Revised: 02/01/2023] [Accepted: 02/28/2023] [Indexed: 03/14/2023] Open
Abstract
16S rRNA gene sequence clustering is an important tool in characterizing the diversity of microbial communities. As 16S rRNA gene data sets are growing in size, existing sequence clustering algorithms increasingly become an analytical bottleneck. Part of this bottleneck is due to the substantial computational cost expended on small clusters and singleton sequences. We propose an iterative sampling-based 16S rRNA gene sequence clustering approach that targets the largest clusters in the data set, allowing users to stop the clustering process when sufficient clusters are available for the specific analysis being targeted. We describe a probabilistic analysis of the iterative clustering process that supports the intuition that the clustering process identifies the larger clusters in the data set first. Using real data sets of 16S rRNA gene sequences, we show that the iterative algorithm, coupled with an adaptive sampling process and a mode-shifting strategy for identifying cluster representatives, substantially speeds up the clustering process while being effective at capturing the large clusters in the data set. The experiments also show that SCRAPT (Sample, Cluster, Recruit, AdaPt and iTerate) is able to produce operational taxonomic units that are less fragmented than popular tools: UCLUST, CD-HIT and DNACLUST. The algorithm is implemented in the open-source package SCRAPT. The source code used to generate the results presented in this paper is available at https://github.com/hsmurali/SCRAPT.
Collapse
Affiliation(s)
- Tu Luan
- Department of Computer Science, University of Maryland, College Park, 20742 MD, USA
- Center for Bioinformatics and Computational Biology, University of Maryland, College Park, MD 20742, USA
| | - Harihara Subrahmaniam Muralidharan
- Department of Computer Science, University of Maryland, College Park, 20742 MD, USA
- Center for Bioinformatics and Computational Biology, University of Maryland, College Park, MD 20742, USA
| | - Marwan Alshehri
- Department of Computer Science, University of Maryland, College Park, 20742 MD, USA
| | - Ipsa Mittra
- Department of Computer Science, University of Maryland, College Park, 20742 MD, USA
| | - Mihai Pop
- Department of Computer Science, University of Maryland, College Park, 20742 MD, USA
- Center for Bioinformatics and Computational Biology, University of Maryland, College Park, MD 20742, USA
| |
Collapse
|
13
|
Rubio A, Sprang M, Garzón A, Moreno-Rodriguez A, Pachón-Ibáñez ME, Pachón J, Andrade-Navarro MA, Pérez-Pulido AJ. Analysis of bacterial pangenomes reduces CRISPR dark matter and reveals strong association between membranome and CRISPR-Cas systems. SCIENCE ADVANCES 2023; 9:eadd8911. [PMID: 36961900 PMCID: PMC10038342 DOI: 10.1126/sciadv.add8911] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 07/11/2022] [Accepted: 02/17/2023] [Indexed: 06/18/2023]
Abstract
CRISPR-Cas systems are prokaryotic acquired immunity mechanisms, which are found in 40% of bacterial genomes. They prevent viral infections through small DNA fragments called spacers. However, the vast majority of these spacers have not yet been associated with the virus they recognize, and it has been named CRISPR dark matter. By analyzing the spacers of tens of thousands of genomes from six bacterial species, we have been able to reduce the CRISPR dark matter from 80% to as low as 15% in some of the species. In addition, we have observed that, when a genome presents CRISPR-Cas systems, this is accompanied by particular sets of membrane proteins. Our results suggest that when bacteria present membrane proteins that make it compete better in its environment and these proteins are, in turn, receptors for specific phages, they would be forced to acquire CRISPR-Cas.
Collapse
Affiliation(s)
- Alejandro Rubio
- Andalusian Centre for Developmental Biology (CABD, UPO-CSIC-JA), Faculty of Experimental Sciences (Genetics Department), University Pablo de Olavide, 41013 Seville, Spain
| | - Maximilian Sprang
- Faculty of Biology, Johannes Gutenberg-Universität Mainz, Biozentrum I, Hans-Dieter-Hüsch-Weg 15, 55128 Mainz, Germany
| | - Andrés Garzón
- Andalusian Centre for Developmental Biology (CABD, UPO-CSIC-JA), Faculty of Experimental Sciences (Genetics Department), University Pablo de Olavide, 41013 Seville, Spain
| | - Antonio Moreno-Rodriguez
- Andalusian Centre for Developmental Biology (CABD, UPO-CSIC-JA), Faculty of Experimental Sciences (Genetics Department), University Pablo de Olavide, 41013 Seville, Spain
| | - Maria Eugenia Pachón-Ibáñez
- Institute of Biomedicine of Seville (IBiS), Virgen del Rocío Hospital/CSIC/University of Seville, Seville, Spain
- CIBER de Enfermedades Infecciosas (CIBERINFEC), Instituto de Salud Carlos III, Madrid, Spain
| | - Jerónimo Pachón
- Institute of Biomedicine of Seville (IBiS), Virgen del Rocío Hospital/CSIC/University of Seville, Seville, Spain
- Department of Medicine, School of Medicine, University of Seville, Seville, Spain
| | - Miguel A. Andrade-Navarro
- Faculty of Biology, Johannes Gutenberg-Universität Mainz, Biozentrum I, Hans-Dieter-Hüsch-Weg 15, 55128 Mainz, Germany
| | - Antonio J. Pérez-Pulido
- Andalusian Centre for Developmental Biology (CABD, UPO-CSIC-JA), Faculty of Experimental Sciences (Genetics Department), University Pablo de Olavide, 41013 Seville, Spain
| |
Collapse
|
14
|
Neupane A, Chariker JH, Rouchka EC. Structural and Functional Classification of G-Quadruplex Families within the Human Genome. Genes (Basel) 2023; 14:genes14030645. [PMID: 36980918 PMCID: PMC10048163 DOI: 10.3390/genes14030645] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/13/2023] [Revised: 02/22/2023] [Accepted: 03/02/2023] [Indexed: 03/08/2023] Open
Abstract
G-quadruplexes (G4s) are short secondary DNA structures located throughout genomic DNA and transcribed RNA. Although G4 structures have been shown to form in vivo, no current search tools that examine these structures based on previously identified G-quadruplexes and filter them based on similar sequence, structure, and thermodynamic properties are known to exist. We present a framework for clustering G-quadruplex sequences into families using the CD-HIT, MeShClust, and DNACLUST methods along with a combination of Starcode and BLAST. Utilizing this framework to filter and annotate clusters, 95 families of G-quadruplex sequences were identified within the human genome. Profiles for each family were created using hidden Markov models to allow for the identification of additional family members and generate homology probability scores. The thermodynamic folding energy properties, functional annotation of genes associated with the sequences, scores from different prediction algorithms, and transcription factor binding motifs within a family were used to annotate and compare the diversity within and across clusters. The resulting set of G-quadruplex families can be used to further understand how different regions of the genome are regulated by factors targeting specific structures common to members of a specific cluster.
Collapse
Affiliation(s)
- Aryan Neupane
- School of Graduate and Interdisciplinary Studies, University of Louisville, Louisville, KY 40292, USA
| | - Julia H. Chariker
- Department of Neuroscience Training, University of Louisville, Louisville, KY 40292, USA
- Kentucky IDeA Network of Biomedical Research Excellence (KY INBRE) Bioinformatics Core, University of Louisville, Louisville, KY 40292, USA
| | - Eric C. Rouchka
- Kentucky IDeA Network of Biomedical Research Excellence (KY INBRE) Bioinformatics Core, University of Louisville, Louisville, KY 40292, USA
- Department of Biochemistry and Molecular Genetics, University of Louisville, Louisville, KY 40292, USA
- Correspondence: ; Tel.: +1-(502)-852-3060
| |
Collapse
|
15
|
Federated learning review: Fundamentals, enabling technologies, and future applications. Inf Process Manag 2022. [DOI: 10.1016/j.ipm.2022.103061] [Citation(s) in RCA: 5] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/25/2022]
|
16
|
Molo MS, White JB, Cornish V, Gell RM, Baars O, Singh R, Carbone MA, Isakeit T, Wise KA, Woloshuk CP, Bluhm BH, Horn BW, Heiniger RW, Carbone I. Asymmetrical lineage introgression and recombination in populations of Aspergillus flavus: Implications for biological control. PLoS One 2022; 17:e0276556. [PMID: 36301851 PMCID: PMC9620740 DOI: 10.1371/journal.pone.0276556] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/21/2022] [Accepted: 10/08/2022] [Indexed: 11/23/2022] Open
Abstract
Aspergillus flavus is an agriculturally important fungus that causes ear rot of maize and produces aflatoxins, of which B1 is the most carcinogenic naturally-produced compound. In the US, the management of aflatoxins includes the deployment of biological control agents that comprise two nonaflatoxigenic A. flavus strains, either Afla-Guard (member of lineage IB) or AF36 (lineage IC). We used genotyping-by-sequencing to examine the influence of both biocontrol agents on native populations of A. flavus in cornfields in Texas, North Carolina, Arkansas, and Indiana. This study examined up to 27,529 single-nucleotide polymorphisms (SNPs) in a total of 815 A. flavus isolates, and 353 genome-wide haplotypes sampled before biocontrol application, three months after biocontrol application, and up to three years after initial application. Here, we report that the two distinct A. flavus evolutionary lineages IB and IC differ significantly in their frequency distributions across states. We provide evidence of increased unidirectional gene flow from lineage IB into IC, inferred to be due to the applied Afla-Guard biocontrol strain. Genetic exchange and recombination of biocontrol strains with native strains was detected in as little as three months after biocontrol application and up to one and three years later. There was limited inter-lineage migration in the untreated fields. These findings suggest that biocontrol products that include strains from lineage IB offer the greatest potential for sustained reductions in aflatoxin levels over several years. This knowledge has important implications for developing new biocontrol strategies.
Collapse
Affiliation(s)
- Megan S. Molo
- Department of Entomology and Plant Pathology, Center for Integrated
Fungal Research, North Carolina State University, Raleigh, NC, United States of
America
| | - James B. White
- Department of Entomology and Plant Pathology, Center for Integrated
Fungal Research, North Carolina State University, Raleigh, NC, United States of
America
| | - Vicki Cornish
- Department of Entomology and Plant Pathology, Center for Integrated
Fungal Research, North Carolina State University, Raleigh, NC, United States of
America
| | - Richard M. Gell
- Department of Entomology and Plant Pathology, Center for Integrated
Fungal Research, North Carolina State University, Raleigh, NC, United States of
America
- Program of Genetics, North Carolina State University, Raleigh, North
Carolina, United States of America
| | - Oliver Baars
- Department of Entomology and Plant Pathology, Center for Integrated
Fungal Research, North Carolina State University, Raleigh, NC, United States of
America
| | - Rakhi Singh
- Department of Entomology and Plant Pathology, Center for Integrated
Fungal Research, North Carolina State University, Raleigh, NC, United States of
America
| | - Mary Anna Carbone
- Center for Integrated Fungal Research and Department of Plant and
Microbial Biology, North Carolina State University, Raleigh, NC, United States
of America
| | - Thomas Isakeit
- Department of Plant Pathology and Microbiology, Texas AgriLife Extension
Service, Texas A&M University, College Station, TX, United States of
America
| | - Kiersten A. Wise
- Department of Plant Pathology, University of Kentucky, Princeton, KY,
United States of America
| | - Charles P. Woloshuk
- Department of Plant Pathology and Botany, Purdue University, West
Lafayette, IN, United States of America
| | - Burton H. Bluhm
- University of Arkansas Division of Agriculture, Department of Entomology
and Plant Pathology, Fayetteville, AR, United States of
America
| | - Bruce W. Horn
- United States Department of Agriculture, Agriculture Research Service,
Dawson, GA, United States of America
| | - Ron W. Heiniger
- Department of Crop and Soil Sciences, North Carolina State University,
Raleigh, NC, United States of America
| | - Ignazio Carbone
- Department of Entomology and Plant Pathology, Center for Integrated
Fungal Research, North Carolina State University, Raleigh, NC, United States of
America
- Program of Genetics, North Carolina State University, Raleigh, North
Carolina, United States of America
- * E-mail:
| |
Collapse
|
17
|
Qu G, Yan Z, Wu H. Clover: tree structure-based efficient DNA clustering for DNA-based data storage. Brief Bioinform 2022; 23:6668252. [PMID: 35975958 DOI: 10.1093/bib/bbac336] [Citation(s) in RCA: 5] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/18/2022] [Revised: 07/21/2022] [Accepted: 07/22/2022] [Indexed: 11/12/2022] Open
Abstract
Deoxyribonucleic acid (DNA)-based data storage is a promising new storage technology which has the advantage of high storage capacity and long storage time compared with traditional storage media. However, the synthesis and sequencing process of DNA can randomly generate many types of errors, which makes it more difficult to cluster DNA sequences to recover DNA information. Currently, the available DNA clustering algorithms are targeted at DNA sequences in the biological domain, which not only cannot adapt to the characteristics of sequences in DNA storage, but also tend to be unacceptably time-consuming for billions of DNA sequences in DNA storage. In this paper, we propose an efficient DNA clustering method termed Clover for DNA storage with linear computational complexity and low memory. Clover avoids the computation of the Levenshtein distance by using a tree structure for interval-specific retrieval. We argue through theoretical proofs that Clover has standard linear computational complexity, low space complexity, etc. Experiments show that our method can cluster 10 million DNA sequences into 50 000 classes in 10 s and meet an accuracy rate of over 99%. Furthermore, we have successfully completed an unprecedented clustering of 10 billion DNA data on a single home computer and the time consumption still satisfies the linear relationship. Clover is freely available at https://github.com/Guanjinqu/Clover.
Collapse
Affiliation(s)
- Guanjin Qu
- Center for Applied Mathematics, Tianjin University, Tianjin, 300072, China
| | - Zihui Yan
- Center for Applied Mathematics, Tianjin University, Tianjin, 300072, China
| | - Huaming Wu
- Center for Applied Mathematics, Tianjin University, Tianjin, 300072, China
| |
Collapse
|
18
|
Swain MT, Vickers M. Interpreting alignment-free sequence comparison: what makes a score a good score? NAR Genom Bioinform 2022; 4:lqac062. [PMID: 36071721 PMCID: PMC9442500 DOI: 10.1093/nargab/lqac062] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/01/2022] [Revised: 07/01/2022] [Accepted: 08/16/2022] [Indexed: 11/13/2022] Open
Abstract
Alignment-free methods are alternatives to alignment-based methods when searching sequence data sets. The output from an alignment-free sequence comparison is a similarity score, the interpretation of which is not straightforward. We propose objective functions to interpret and calibrate outputs from alignment-free searches, noting that different objective functions are necessary for different biological contexts. This leads to advantages: visualising and comparing score distributions, including those from true positives, may be a relatively simple method to gain insight into the performance of different metrics. Using an empirical approach with both DNA and protein sequences, we characterise different similarity score distributions generated under different parameters. In particular, we demonstrate how sequence length can affect the scores. We show that scores of true positive sequence pairs may correlate significantly with their mean length; and even if the correlation is weak, the relative difference in length of the sequence pair may significantly reduce the effectiveness of alignment-free metrics. Importantly, we show how objective functions can be used with test data to accurately estimate the probability of true positives. This can significantly increase the utility of alignment-free approaches. Finally, we have developed a general-purpose software tool called KAST for use in high-throughput workflows on Linux clusters.
Collapse
Affiliation(s)
- Martin T Swain
- Department of Life Sciences, Aberystwyth University , Penglais, Aberystwyth, Ceredigion, SY23 3DA, UK
| | - Martin Vickers
- The John Innes Centre, Norwich Research Park , Norwich NR4 7UH, UK
| |
Collapse
|
19
|
Girgis HZ. MeShClust v3.0: high-quality clustering of DNA sequences using the mean shift algorithm and alignment-free identity scores. BMC Genomics 2022; 23:423. [PMID: 35668366 PMCID: PMC9171953 DOI: 10.1186/s12864-022-08619-0] [Citation(s) in RCA: 7] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/15/2022] [Accepted: 05/11/2022] [Indexed: 11/22/2022] Open
Abstract
Background Tools for accurately clustering biological sequences are among the most important tools in computational biology. Two pioneering tools for clustering sequences are CD-HIT and UCLUST, both of which are fast and consume reasonable amounts of memory; however, there is a big room for improvement in terms of cluster quality. Motivated by this opportunity for improving cluster quality, we applied the mean shift algorithm in MeShClust v1.0. The mean shift algorithm is an instance of unsupervised learning. Its strong theoretical foundation guarantees the convergence to the true cluster centers. Our implementation of the mean shift algorithm in MeShClust v1.0 was a step forward. In this work, we scale up the algorithm by adapting an out-of-core strategy while utilizing alignment-free identity scores in a new tool: MeShClust v3.0. Results We evaluated CD-HIT, MeShClust v1.0, MeShClust v3.0, and UCLUST on 22 synthetic sets and five real sets. These data sets were designed or selected for testing the tools in terms of scalability and different similarity levels among sequences comprising clusters. On the synthetic data sets, MeShClust v3.0 outperformed the related tools on all sets in terms of cluster quality. On two real data sets obtained from human microbiome and maize transposons, MeShClust v3.0 outperformed the related tools by wide margins, achieving 55%–300% improvement in cluster quality. On another set that includes degenerate viral sequences, MeShClust v3.0 came third. On two bacterial sets, MeShClust v3.0 was the only applicable tool because of the long sequences in these sets. MeShClust v3.0 requires more time and memory than the related tools; almost all personal computers at the time of this writing can accommodate such requirements. MeShClust v3.0 can estimate an important parameter that controls cluster membership with high accuracy. Conclusions These results demonstrate the high quality of clusters produced by MeShClust v3.0 and its ability to apply the mean shift algorithm to large data sets and long sequences. Because clustering tools are utilized in many studies, providing high-quality clusters will help with deriving accurate biological knowledge. Supplementary Information The online version contains supplementary material available at (10.1186/s12864-022-08619-0).
Collapse
Affiliation(s)
- Hani Z Girgis
- Bioinformatics Toolsmith Laboratory, Department of Electrical Engineering and Computer Science, Texas A&M University-Kingsville, Kingsville, TX, USA.
| |
Collapse
|
20
|
Aunin E, Berriman M, Reid AJ. Characterising genome architectures using genome decomposition analysis. BMC Genomics 2022; 23:398. [PMID: 35610562 PMCID: PMC9131526 DOI: 10.1186/s12864-022-08616-3] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/15/2021] [Accepted: 05/10/2022] [Indexed: 12/14/2022] Open
Abstract
Genome architecture describes how genes and other features are arranged in genomes. These arrangements reflect the evolutionary pressures on genomes and underlie biological processes such as chromosomal segregation and the regulation of gene expression. We present a new tool called Genome Decomposition Analysis (GDA) that characterises genome architectures and acts as an accessible approach for discovering hidden features of a genome assembly. With the imminent deluge of high-quality genome assemblies from projects such as the Darwin Tree of Life and the Earth BioGenome Project, GDA has been designed to facilitate their exploration and the discovery of novel genome biology. We highlight the effectiveness of our approach in characterising the genome architectures of single-celled eukaryotic parasites from the phylum Apicomplexa and show that it scales well to large genomes.
Collapse
Affiliation(s)
- Eerik Aunin
- Wellcome Sanger Institute, Cambridge, CB10 1SA, UK
| | - Matthew Berriman
- Wellcome Sanger Institute, Cambridge, CB10 1SA, UK
- Wellcome Centre for Integrative Parasitology, University of Glasgow, G12 8TA, Glasgow, UK
| | - Adam James Reid
- Wellcome Sanger Institute, Cambridge, CB10 1SA, UK.
- Wellcome/Cancer Research UK Gurdon Institute, University of Cambridge, CB2 1QN, Cambridge, UK.
| |
Collapse
|
21
|
Kioukis A, Pourjam M, Neuhaus K, Lagkouvardos I. Taxonomy Informed Clustering, an Optimized Method for Purer and More Informative Clusters in Diversity Analysis and Microbiome Profiling. FRONTIERS IN BIOINFORMATICS 2022; 2:864597. [PMID: 36304326 PMCID: PMC9580952 DOI: 10.3389/fbinf.2022.864597] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/28/2022] [Accepted: 03/31/2022] [Indexed: 11/13/2022] Open
Abstract
Bacterial diversity is often analyzed using 16S rRNA gene amplicon sequencing. Commonly, sequences are clustered based on similarity cutoffs to obtain groups reflecting molecular species, genera, or families. Due to the amount of the generated sequencing data, greedy algorithms are preferred for their time efficiency. Such algorithms rely only on pairwise sequence similarities. Thus, sometimes sequences with diverse phylogenetic background are clustered together. In contrast, taxonomic classifiers use position specific taxonomic information in assigning a probable taxonomy to a given sequence. Here we introduce Taxonomy Informed Clustering (TIC), a novel approach that utilizes classifier-assigned taxonomy to restrict clustering to only those sequences that share the same taxonomic path. Based on this concept, we offer a complete and automated pipeline for processing of 16S rRNA amplicon datasets in diversity analyses. First, raw reads are processed to form denoised amplicons. Next, the denoised amplicons are taxonomically classified. Finally, the TIC algorithm progressively assigning clusters at molecular species, genus and family levels. TIC outperforms greedy clustering algorithms like USEARCH and VSEARCH in terms of clusters’ purity and entropy, when using data from the Living Tree Project as test samples. Furthermore, we applied TIC on a dataset containing all Bifidobacteriaceae-classified sequences from the IMNGS database. Here, TIC identified evidence for 1000s of novel molecular genera and species. These results highlight the straightforward application of the TIC pipeline and superior results compared to former methods in diversity studies. The pipeline is freely available at: https://github.com/Lagkouvardos/TIC.
Collapse
Affiliation(s)
| | - Mohsen Pourjam
- Core Facility Microbiome, ZIEL – Institute for Food & Health, Technical University Munich, Freising, Germany
| | - Klaus Neuhaus
- Core Facility Microbiome, ZIEL – Institute for Food & Health, Technical University Munich, Freising, Germany
| | - Ilias Lagkouvardos
- Core Facility Microbiome, ZIEL – Institute for Food & Health, Technical University Munich, Freising, Germany
- Institute of Marine Biology, Biotechnology and Aquaculture, Hellenic Centre for Marine Research, Heraklion, Greece
- *Correspondence: Ilias Lagkouvardos,
| |
Collapse
|
22
|
Chiu JKH, Ong RTH. Clustering biological sequences with dynamic sequence similarity threshold. BMC Bioinformatics 2022; 23:108. [PMID: 35354426 PMCID: PMC8969259 DOI: 10.1186/s12859-022-04643-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/26/2021] [Accepted: 03/02/2022] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Biological sequence clustering is a complicated data clustering problem owing to the high computation costs incurred for pairwise sequence distance calculations through sequence alignments, as well as difficulties in determining parameters for deriving robust clusters. While current approaches are successful in reducing the number of sequence alignments performed, the generated clusters are based on a single sequence identity threshold applied to every cluster. Poor choices of this identity threshold would thus lead to low quality clusters. There is however little support provided to users in selecting thresholds that are well matched with the input sequences. RESULTS We present a novel sequence clustering approach called ALFATClust that exploits rapid pairwise alignment-free sequence distance calculations and community detection in graph for clusters generation. Instead of a single threshold applied to every generated cluster, ALFATClust is capable of dynamically determining the cut-off threshold for each individual cluster by considering both cluster separation and intra-cluster sequence similarity. Benchmarking analysis shows that ALFATClust generally outperforms existing approaches by simultaneously maintaining cluster robustness and substantial cluster separation for the benchmark datasets. The software also provides an evaluation report for verifying the quality of the non-singleton clusters obtained. CONCLUSIONS ALFATClust is able to generate sequence clusters having high intra-cluster sequence similarity and substantial separation between clusters without having users to decide precise similarity cut-off thresholds.
Collapse
Affiliation(s)
- Jimmy Ka Ho Chiu
- Saw Swee Hock School of Public Health, National University of Singapore and National University Health System, Singapore, 117549, Singapore
| | - Rick Twee-Hee Ong
- Saw Swee Hock School of Public Health, National University of Singapore and National University Health System, Singapore, 117549, Singapore.
| |
Collapse
|
23
|
Furuta Y, Miura F, Ichise T, Nakayama SMM, Ikenaka Y, Zorigt T, Tsujinouchi M, Ishizuka M, Ito T, Higashi H. A GCDGC-specific DNA (cytosine-5) methyltransferase that methylates the GCWGC sequence on both strands and the GCSGC sequence on one strand. PLoS One 2022; 17:e0265225. [PMID: 35312710 PMCID: PMC8936443 DOI: 10.1371/journal.pone.0265225] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/24/2021] [Accepted: 02/24/2022] [Indexed: 11/18/2022] Open
Abstract
5-Methylcytosine is one of the major epigenetic marks of DNA in living organisms. Some bacterial species possess DNA methyltransferases that modify cytosines on both strands to produce fully-methylated sites or on either strand to produce hemi-methylated sites. In this study, we characterized a DNA methyltransferase that produces two sequences with different methylation patterns: one methylated on both strands and another on one strand. M.BatI is the orphan DNA methyltransferase of Bacillus anthracis coded in one of the prophages on the chromosome. Analysis of M.BatI modified DNA by bisulfite sequencing revealed that the enzyme methylates the first cytosine in sequences of 5ʹ-GCAGC-3ʹ, 5ʹ-GCTGC-3ʹ, and 5ʹ-GCGGC-3ʹ, but not of 5ʹ-GCCGC-3ʹ. This resulted in the production of fully-methylated 5ʹ-GCWGC-3ʹ and hemi-methylated 5ʹ-GCSGC-3ʹ. M.BatI also showed toxicity when expressed in E. coli, which was caused by a mechanism other than DNA modification activity. Homologs of M.BatI were found in other Bacillus species on different prophage like regions, suggesting the spread of the gene by several different phages. The discovery of the DNA methyltransferase with unique modification target specificity suggested unrevealed diversity of target sequences of bacterial cytosine DNA methyltransferase.
Collapse
Affiliation(s)
- Yoshikazu Furuta
- Division of Infection and Immunity, International Institute for Zoonosis Control, Hokkaido University, Sapporo, Japan
- * E-mail:
| | - Fumihito Miura
- Department of Biochemistry, Kyushu University Graduate School of Medical Sciences, Fukuoka, Japan
| | - Takahiro Ichise
- Laboratory of Toxicology, Department of Environmental Veterinary Sciences, School of Veterinary Medicine, Hokkaido University, Sapporo, Japan
| | - Shouta M. M. Nakayama
- Laboratory of Toxicology, Department of Environmental Veterinary Sciences, School of Veterinary Medicine, Hokkaido University, Sapporo, Japan
| | - Yoshinori Ikenaka
- Laboratory of Toxicology, Department of Environmental Veterinary Sciences, School of Veterinary Medicine, Hokkaido University, Sapporo, Japan
- Water Research Group, Unit for Environmental Sciences and Management, North-West University, Potchefstroom, South Africa
| | - Tuvshinzaya Zorigt
- Division of Infection and Immunity, International Institute for Zoonosis Control, Hokkaido University, Sapporo, Japan
| | - Mai Tsujinouchi
- Division of Infection and Immunity, International Institute for Zoonosis Control, Hokkaido University, Sapporo, Japan
| | - Mayumi Ishizuka
- Laboratory of Toxicology, Department of Environmental Veterinary Sciences, School of Veterinary Medicine, Hokkaido University, Sapporo, Japan
| | - Takashi Ito
- Department of Biochemistry, Kyushu University Graduate School of Medical Sciences, Fukuoka, Japan
| | - Hideaki Higashi
- Division of Infection and Immunity, International Institute for Zoonosis Control, Hokkaido University, Sapporo, Japan
| |
Collapse
|
24
|
Abstract
Modern sequencing technologies have provided insight into the genetic diversity of numerous species, including the human pathogen Pseudomonas aeruginosa. Bacterial genomes often harbor bacteriophage genomes (prophages), which can account for upwards of 20% of the genome. Prior studies have found P. aeruginosa prophages that contribute to their host’s pathogenicity and fitness. These advantages come in many different forms, including the production of toxins, promotion of biofilm formation, and displacement of other P. aeruginosa strains. While several different genera and species of P. aeruginosa prophages have been studied, there has not been a comprehensive study of the overall diversity of P. aeruginosa-infecting prophages. Here, we present the results of just such an analysis. A total of 6,852 high-confidence prophages were identified from 5,383 P. aeruginosa genomes from strains isolated from the human body and other environments. In total, 3,201 unique prophage sequences were identified. While 53.1% of these prophage sequences displayed sequence similarity to publicly available phage genomes, novel and highly mosaic prophages were discovered. Among these prophages, there is extensive diversity, including diversity within the functionally conserved integrase and C repressor coding regions, two genes responsible for prophage entering and persisting through the lysogenic life cycle. Analysis of integrase, C repressor, and terminase coding regions revealed extensive reassortment among P. aeruginosa prophages. This catalog of P. aeruginosa prophages provides a resource for future studies into the evolution of the species. IMPORTANCE Prophages play a critical role in the evolution of their host species and can also contribute to the virulence and fitness of pathogenic species. Here, we conducted a comprehensive investigation of prophage sequences from 5,383 publicly available Pseudomonas aeruginosa genomes from human as well as environmental isolates. We identified a diverse population of prophages, including tailed phages, inoviruses, and microviruses; 46.9% of the prophage sequences found share no significant sequence similarity with characterized phages, representing a vast array of novel P. aeruginosa-infecting phages. Our investigation into these prophages found substantial evidence of reassortment. In producing this, the first catalog of P. aeruginosa prophages, we uncovered both novel prophages as well as genetic content that have yet to be explored.
Collapse
|
25
|
Millán Arias P, Alipour F, Hill KA, Kari L. DeLUCS: Deep learning for unsupervised clustering of DNA sequences. PLoS One 2022; 17:e0261531. [PMID: 35061715 PMCID: PMC8782307 DOI: 10.1371/journal.pone.0261531] [Citation(s) in RCA: 11] [Impact Index Per Article: 5.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/03/2021] [Accepted: 12/06/2021] [Indexed: 11/25/2022] Open
Abstract
We present a novel Deep Learning method for the Unsupervised Clustering of DNA Sequences (DeLUCS) that does not require sequence alignment, sequence homology, or (taxonomic) identifiers. DeLUCS uses Frequency Chaos Game Representations (FCGR) of primary DNA sequences, and generates “mimic” sequence FCGRs to self-learn data patterns (genomic signatures) through the optimization of multiple neural networks. A majority voting scheme is then used to determine the final cluster assignment for each sequence. The clusters learned by DeLUCS match true taxonomic groups for large and diverse datasets, with accuracies ranging from 77% to 100%: 2,500 complete vertebrate mitochondrial genomes, at taxonomic levels from sub-phylum to genera; 3,200 randomly selected 400 kbp-long bacterial genome segments, into clusters corresponding to bacterial families; three viral genome and gene datasets, averaging 1,300 sequences each, into clusters corresponding to virus subtypes. DeLUCS significantly outperforms two classic clustering methods (K-means++ and Gaussian Mixture Models) for unlabelled data, by as much as 47%. DeLUCS is highly effective, it is able to cluster datasets of unlabelled primary DNA sequences totalling over 1 billion bp of data, and it bypasses common limitations to classification resulting from the lack of sequence homology, variation in sequence length, and the absence or instability of sequence annotations and taxonomic identifiers. Thus, DeLUCS offers fast and accurate DNA sequence clustering for previously intractable datasets.
Collapse
Affiliation(s)
- Pablo Millán Arias
- School of Computer Science, University of Waterloo, Waterloo, ON, Canada
- * E-mail: (PMA); (FA)
| | - Fatemeh Alipour
- School of Computer Science, University of Waterloo, Waterloo, ON, Canada
- * E-mail: (PMA); (FA)
| | - Kathleen A. Hill
- Department of Biology, University of Western Ontario, London, ON, Canada
| | - Lila Kari
- School of Computer Science, University of Waterloo, Waterloo, ON, Canada
| |
Collapse
|
26
|
Cao M, Peng Q, Wei ZG, Liu F, Hou YF. EdClust: A heuristic sequence clustering method with higher sensitivity. J Bioinform Comput Biol 2021; 20:2150036. [PMID: 34939905 DOI: 10.1142/s0219720021500360] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
Abstract
The development of high-throughput technologies has produced increasing amounts of sequence data and an increasing need for efficient clustering algorithms that can process massive volumes of sequencing data for downstream analysis. Heuristic clustering methods are widely applied for sequence clustering because of their low computational complexity. Although numerous heuristic clustering methods have been developed, they suffer from two limitations: overestimation of inferred clusters and low clustering sensitivity. To address these issues, we present a new sequence clustering method (edClust) based on Edlib, a C/C[Formula: see text] library for fast, exact semi-global sequence alignment to group similar sequences. The new method edClust was tested on three large-scale sequence databases, and we compared edClust to several classic heuristic clustering methods, such as UCLUST, CD-HIT, and VSEARCH. Evaluations based on the metrics of cluster number and seed sensitivity (SS) demonstrate that edClust can produce fewer clusters than other methods and that its SS is higher than that of other methods. The source codes of edClust are available from https://github.com/zhang134/EdClust.git under the GNU GPL license.
Collapse
Affiliation(s)
- Ming Cao
- Faculty of Electronic and Information Engineering, Xi'an Jiaotong University, Xi'an, 710049, P. R. China.,School of Mathematics and Statistics, Shaanxi Xueqian Normal University, Xi'an, 710100, P. R. China
| | - Qinke Peng
- Faculty of Electronic and Information Engineering, Xi'an Jiaotong University, Xi'an, 710049, P. R. China
| | - Ze-Gang Wei
- Institute of Physics and Optoelectronics Technology, Baoji University of Arts and Sciences, Baoji, 721016, P. R. China
| | - Fei Liu
- Institute of Physics and Optoelectronics Technology, Baoji University of Arts and Sciences, Baoji, 721016, P. R. China
| | - Yi-Fan Hou
- Institute of Physics and Optoelectronics Technology, Baoji University of Arts and Sciences, Baoji, 721016, P. R. China
| |
Collapse
|
27
|
Melnyk A, Mohebbi F, Knyazev S, Sahoo B, Hosseini R, Skums P, Zelikovsky A, Patterson M. From Alpha to Zeta: Identifying Variants and Subtypes of SARS-CoV-2 Via Clustering. J Comput Biol 2021; 28:1113-1129. [PMID: 34698508 DOI: 10.1089/cmb.2021.0302] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/24/2022] Open
Abstract
The availability of millions of SARS-CoV-2 (Severe Acute Respiratory Syndrome-Coronavirus-2) sequences in public databases such as GISAID (Global Initiative on Sharing All Influenza Data) and EMBL-EBI (European Molecular Biology Laboratory-European Bioinformatics Institute) (the United Kingdom) allows a detailed study of the evolution, genomic diversity, and dynamics of a virus such as never before. Here, we identify novel variants and subtypes of SARS-CoV-2 by clustering sequences in adapting methods originally designed for haplotyping intrahost viral populations. We asses our results using clustering entropy-the first time it has been used in this context. Our clustering approach reaches lower entropies compared with other methods, and we are able to boost this even further through gap filling and Monte Carlo-based entropy minimization. Moreover, our method clearly identifies the well-known Alpha variant in the U.K. and GISAID data sets, and is also able to detect the much less represented (<1% of the sequences) Beta (South Africa), Epsilon (California), and Gamma and Zeta (Brazil) variants in the GISAID data set. Finally, we show that each variant identified has high selective fitness, based on the growth rate of its cluster over time. This demonstrates that our clustering approach is a viable alternative for detecting even rare subtypes in very large data sets.
Collapse
Affiliation(s)
- Andrew Melnyk
- Department of Computer Science, Georgia State University, Atlanta, Georgia, USA
| | - Fatemeh Mohebbi
- Department of Computer Science, Georgia State University, Atlanta, Georgia, USA
| | - Sergey Knyazev
- Department of Computer Science, Georgia State University, Atlanta, Georgia, USA
| | - Bikram Sahoo
- Department of Computer Science, Georgia State University, Atlanta, Georgia, USA
| | - Roya Hosseini
- Department of Computer Science, Georgia State University, Atlanta, Georgia, USA
| | - Pavel Skums
- Department of Computer Science, Georgia State University, Atlanta, Georgia, USA
| | - Alex Zelikovsky
- Department of Computer Science, Georgia State University, Atlanta, Georgia, USA.,World-Class Research Center "Digital Biodesign and Personalized Healthcare," I.M. Sechenov First Moscow State Medical University, Moscow, Russia
| | - Murray Patterson
- Department of Computer Science, Georgia State University, Atlanta, Georgia, USA
| |
Collapse
|
28
|
Analysis of SINE Families B2, Dip, and Ves with Special Reference to Polyadenylation Signals and Transcription Terminators. Int J Mol Sci 2021; 22:ijms22189897. [PMID: 34576060 PMCID: PMC8466645 DOI: 10.3390/ijms22189897] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/16/2021] [Revised: 09/05/2021] [Accepted: 09/06/2021] [Indexed: 01/09/2023] Open
Abstract
Short Interspersed Elements (SINEs) are eukaryotic non-autonomous retrotransposons transcribed by RNA polymerase III (pol III). The 3′-terminus of many mammalian SINEs has a polyadenylation signal (AATAAA), pol III transcription terminator, and A-rich tail. The RNAs of such SINEs can be polyadenylated, which is unique for pol III transcripts. Here, B2 (mice and related rodents), Dip (jerboas), and Ves (vespertilionid bats) SINE families were thoroughly studied. They were divided into subfamilies reliably distinguished by relatively long indels. The age of SINE subfamilies can be estimated, which allows us to reconstruct their evolution. The youngest and most active variants of SINE subfamilies were given special attention. The shortest pol III transcription terminators are TCTTT (B2), TATTT (Ves and Dip), and the rarer TTTT. The last nucleotide of the terminator is often not transcribed; accordingly, the truncated terminator of its descendant becomes nonfunctional. The incidence of complete transcription of the TCTTT terminator is twice higher compared to TTTT and thus functional terminators are more likely preserved in daughter SINE copies. Young copies have long poly(A) tails; however, they gradually shorten in host generations. Unexpectedly, the tail shortening below A10 increases the incidence of terminator elongation by Ts thus restoring its efficiency. This process can be critical for the maintenance of SINE activity in the genome.
Collapse
|
29
|
Patin NV, Dietrich ZA, Stancil A, Quinan M, Beckler JS, Hall ER, Culter J, Smith CG, Taillefert M, Stewart FJ. Gulf of Mexico blue hole harbors high levels of novel microbial lineages. THE ISME JOURNAL 2021; 15:2206-2232. [PMID: 33612832 PMCID: PMC8319197 DOI: 10.1038/s41396-021-00917-x] [Citation(s) in RCA: 6] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 08/01/2020] [Revised: 01/14/2021] [Accepted: 01/27/2021] [Indexed: 01/31/2023]
Abstract
Exploration of oxygen-depleted marine environments has consistently revealed novel microbial taxa and metabolic capabilities that expand our understanding of microbial evolution and ecology. Marine blue holes are shallow karst formations characterized by low oxygen and high organic matter content. They are logistically challenging to sample, and thus our understanding of their biogeochemistry and microbial ecology is limited. We present a metagenomic and geochemical characterization of Amberjack Hole on the Florida continental shelf (Gulf of Mexico). Dissolved oxygen became depleted at the hole's rim (32 m water depth), remained low but detectable in an intermediate hypoxic zone (40-75 m), and then increased to a secondary peak before falling below detection in the bottom layer (80-110 m), concomitant with increases in nutrients, dissolved iron, and a series of sequentially more reduced sulfur species. Microbial communities in the bottom layer contained heretofore undocumented levels of the recently discovered phylum Woesearchaeota (up to 58% of the community), along with lineages in the bacterial Candidate Phyla Radiation (CPR). Thirty-one high-quality metagenome-assembled genomes (MAGs) showed extensive biochemical capabilities for sulfur and nitrogen cycling, as well as for resisting and respiring arsenic. One uncharacterized gene associated with a CPR lineage differentiated hypoxic from anoxic zone communities. Overall, microbial communities and geochemical profiles were stable across two sampling dates in the spring and fall of 2019. The blue hole habitat is a natural marine laboratory that provides opportunities for sampling taxa with under-characterized but potentially important roles in redox-stratified microbial processes.
Collapse
Affiliation(s)
- N V Patin
- School of Biological Sciences, Georgia Institute of Technology, Atlanta, GA, USA.
- Center for Microbial Dynamics and Infection, Georgia Institute of Technology, Atlanta, GA, USA.
- Ocean Chemistry and Ecosystems Division, Atlantic Oceanographic and Meteorological Laboratory, National Oceanic and Atmospheric Administration, Miami, FL, USA.
- Cooperative Institute for Marine and Atmospheric Studies, Rosenstiel School of Marine and Atmospheric Science, University of Miami, Miami, FL, USA.
- Stationed at Southwest Fisheries Science Center, National Marine Fisheries Service, National Oceanic and Atmospheric Administration, La Jolla, CA, USA.
| | | | - A Stancil
- Harbor Branch Oceanographic Institute, Florida Atlantic University, Ft. Pierce, FL, USA
| | - M Quinan
- Harbor Branch Oceanographic Institute, Florida Atlantic University, Ft. Pierce, FL, USA
| | - J S Beckler
- Harbor Branch Oceanographic Institute, Florida Atlantic University, Ft. Pierce, FL, USA
| | - E R Hall
- Mote Marine Laboratory, Sarasota, FL, USA
| | - J Culter
- Mote Marine Laboratory, Sarasota, FL, USA
| | - C G Smith
- U.S. Geological Survey, St. Petersburg Coastal and Marine Science Center, St. Petersburg, FL, USA
| | - M Taillefert
- School of Earth & Atmospheric Sciences, Georgia Institute of Technology, Atlanta, GA, USA
| | - F J Stewart
- School of Biological Sciences, Georgia Institute of Technology, Atlanta, GA, USA
- Center for Microbial Dynamics and Infection, Georgia Institute of Technology, Atlanta, GA, USA
- Department of Microbiology & Immunology, Montana State University, Bozeman, MT, USA
| |
Collapse
|
30
|
Determination of k-mer density in a DNA sequence and subsequent cluster formation algorithm based on the application of electronic filter. Sci Rep 2021; 11:13701. [PMID: 34211040 PMCID: PMC8249421 DOI: 10.1038/s41598-021-93154-3] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/11/2021] [Accepted: 06/07/2021] [Indexed: 02/06/2023] Open
Abstract
We describe a novel algorithm for information recovery from DNA sequences by using a digital filter. This work proposes a three-part algorithm to decide the k-mer or q-gram word density. Employing a finite impulse response digital filter, one can calculate the sequence's k-mer or q-gram word density. Further principal component analysis is used on word density distribution to analyze the dissimilarity between sequences. A dissimilarity matrix is thus formed and shows the appearance of cluster formation. This cluster formation is constructed based on the alignment-free sequence method. Furthermore, the clusters are used to build phylogenetic relations. The cluster algorithm is in good agreement with alignment-based algorithms. The present algorithm is simple and requires less time for computation than other currently available algorithms. We tested the algorithm using beta hemoglobin coding sequences (HBB) of 10 different species and 18 primate mitochondria genome (mtDNA) sequences.
Collapse
|
31
|
Girgis HZ, James BT, Luczak BB. Identity: rapid alignment-free prediction of sequence alignment identity scores using self-supervised general linear models. NAR Genom Bioinform 2021; 3:lqab001. [PMID: 33554117 PMCID: PMC7850047 DOI: 10.1093/nargab/lqab001] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/14/2019] [Revised: 12/07/2020] [Accepted: 01/08/2021] [Indexed: 11/12/2022] Open
Abstract
Pairwise global alignment is a fundamental step in sequence analysis. Optimal alignment algorithms are quadratic-slow especially on long sequences. In many applications that involve large sequence datasets, all what is needed is calculating the identity scores (percentage of identical nucleotides in an optimal alignment-including gaps-of two sequences); there is no need for visualizing how every two sequences are aligned. For these applications, we propose Identity, which produces global identity scores for a large number of pairs of DNA sequences using alignment-free methods and self-supervised general linear models. For the first time, the new tool can predict pairwise identity scores in linear time and space. On two large-scale sequence databases, Identity provided the best compromise between sensitivity and precision while being faster than BLAST, Mash, MUMmer4 and USEARCH by 2-80 times. Identity was the best performing tool when searching for low-identity matches. While constructing phylogenetic trees from about 6000 transcripts, the tree due to the scores reported by Identity was the closest to the reference tree (in contrast to andi, FSWM and Mash). Identity is capable of producing pairwise identity scores of millions-of-nucleotides-long bacterial genomes; this task cannot be accomplished by any global-alignment-based tool. Availability: https://github.com/BioinformaticsToolsmith/Identity.
Collapse
Affiliation(s)
- Hani Z Girgis
- Bioinformatics Toolsmith Laboratory, Department of Electrical Engineering and Computer Science, Texas A&M University-Kingsville, 700 University Boulevard, Kingsville, TX 78363, USA
| | - Benjamin T James
- Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, 32 Vassar Street, Cambridge, MA 02139, USA
| | - Brian B Luczak
- Department of Mathematics, Vanderbilt University, 1326 Stevenson Center Lane, Nashville, TN 3721, USA
| |
Collapse
|
32
|
Blokh D, Gitarts J, Stambler I. An information-theoretical analysis of gene nucleotide sequence structuredness for a selection of aging and cancer-related genes. Genomics Inform 2020; 18:e41. [PMID: 33412757 PMCID: PMC7808870 DOI: 10.5808/gi.2020.18.4.e41] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/06/2020] [Accepted: 11/27/2020] [Indexed: 12/02/2022] Open
Abstract
We provide an algorithm for the construction and analysis of autocorrelation (information) functions of gene nucleotide sequences. As a measure of correlation between discrete random variables, we use normalized mutual information. The information functions are indicative of the degree of structuredness of gene sequences. We construct the information functions for selected gene sequences. We find a significant difference between information functions of genes of different types. We hypothesize that the features of information functions of gene nucleotide sequences are related to phenotypes of these genes.
Collapse
Affiliation(s)
- David Blokh
- C.D. Technologies Ltd., Beer Sheba 8445914, Israel
| | - Joseph Gitarts
- Efi Arazi School of Computer Science, Interdisciplinary Center, Herzliya 4673304, Israel
| | - Ilia Stambler
- Department of Science, Technology and Society, Bar Ilan University, Ramat Gan 5290002, Israel
- Corresponding author: E-mail:
| |
Collapse
|
33
|
Patin NV, Peña-Gonzalez A, Hatt JK, Moe C, Kirby A, Konstantinidis KT. The Role of the Gut Microbiome in Resisting Norovirus Infection as Revealed by a Human Challenge Study. mBio 2020; 11:e02634-20. [PMID: 33203758 PMCID: PMC7683401 DOI: 10.1128/mbio.02634-20] [Citation(s) in RCA: 23] [Impact Index Per Article: 5.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/22/2020] [Accepted: 10/16/2020] [Indexed: 12/11/2022] Open
Abstract
Norovirus infections take a heavy toll on worldwide public health. While progress has been made toward understanding host responses to infection, the role of the gut microbiome in determining infection outcome is unknown. Moreover, data are lacking on the nature and duration of the microbiome response to norovirus infection, which has important implications for diagnostics and host recovery. Here, we characterized the gut microbiomes of subjects enrolled in a norovirus challenge study. We analyzed microbiome features of asymptomatic and symptomatic individuals at the genome (population) and gene levels and assessed their response over time in symptomatic individuals. We show that the preinfection microbiomes of subjects with asymptomatic infections were enriched in Bacteroidetes and depleted in Clostridia relative to the microbiomes of symptomatic subjects. These compositional differences were accompanied by differences in genes involved in the metabolism of glycans and sphingolipids that may aid in host resilience to infection. We further show that microbiomes shifted in composition following infection and that recovery times were variable among human hosts. In particular, Firmicutes increased immediately following the challenge, while Bacteroidetes and Proteobacteria decreased over the same time. Genes enriched in the microbiomes of symptomatic subjects, including the adenylyltransferase glgC, were linked to glycan metabolism and cell-cell signaling, suggesting as-yet unknown roles for these processes in determining infection outcome. These results provide important context for understanding the gut microbiome role in host susceptibility to symptomatic norovirus infection and long-term health outcomes.IMPORTANCE The role of the human gut microbiome in determining whether an individual infected with norovirus will be symptomatic is poorly understood. This study provides important data on microbes that distinguish asymptomatic from symptomatic microbiomes and links these features to infection responses in a human challenge study. The results have implications for understanding resistance to and treatment of norovirus infections.
Collapse
Affiliation(s)
- N V Patin
- School of Biological Sciences, Georgia Institute of Technology, Atlanta, Georgia, USA
| | - A Peña-Gonzalez
- School of Biological Sciences, Georgia Institute of Technology, Atlanta, Georgia, USA
- Max Planck Tandem Group in Computational Biology, Department of Biological Sciences, Universidad de los Andes, Bogotá, Colombia
| | - J K Hatt
- School of Civil and Environmental Engineering, Georgia Institute of Technology, Atlanta, Georgia, USA
| | - C Moe
- Rollins School of Public Health, Emory University, Atlanta, Georgia, USA
| | - A Kirby
- Waterborne Disease Prevention Branch, Centers for Disease Control and Prevention, Atlanta, Georgia, USA
| | - K T Konstantinidis
- School of Biological Sciences, Georgia Institute of Technology, Atlanta, Georgia, USA
- School of Civil and Environmental Engineering, Georgia Institute of Technology, Atlanta, Georgia, USA
| |
Collapse
|
34
|
Paul T, Vainio S, Roning J. Clustering and classification of virus sequence through music communication protocol and wavelet transform. Genomics 2020; 113:778-784. [PMID: 33069829 PMCID: PMC7561519 DOI: 10.1016/j.ygeno.2020.10.009] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/02/2020] [Accepted: 10/13/2020] [Indexed: 01/19/2023]
Abstract
The coronavirus pandemic became a major risk in global public health. The outbreak is caused by SARS-CoV-2, a member of the coronavirus family. Though the images of the virus are familiar to us, in the present study, an attempt is made to hear the coronavirus by translating its protein spike into audio sequences. The musical features such as pitch, timbre, volume and duration are mapped based on the coronavirus protein sequence. Three different viruses Influenza, Ebola and Coronavirus were studied and compared through their auditory virus sequences by implementing Haar wavelet transform. The sonification of the coronavirus benefits in understanding the protein structures by enhancing the hidden features. Further, it makes a clear difference in the representation of coronavirus compared with other viruses, which will help in various research works related to virus sequence. This evolves as a simplified and novel way of representing the conventional computational methods.
Collapse
Affiliation(s)
- Tirthankar Paul
- InfoTech Oulu, Biomimetics and Intelligent Systems Group (BISG), Faculty of Information Technology and Electrical Engineering, University of Oulu, Oulu, Finland.
| | - Seppo Vainio
- InfoTech Oulu, Faculty of Biochemistry and Molecular Medicine, Biocenter Oulu, Laboratory of Development Biology, University of Oulu, Oulu, Finland.
| | - Juha Roning
- InfoTech Oulu, Biomimetics and Intelligent Systems Group (BISG), Faculty of Information Technology and Electrical Engineering, University of Oulu, Oulu, Finland.
| |
Collapse
|
35
|
Review of Hepatitis E Virus in Rats: Evident Risk of Species Orthohepevirus C to Human Zoonotic Infection and Disease. Viruses 2020; 12:v12101148. [PMID: 33050353 PMCID: PMC7600399 DOI: 10.3390/v12101148] [Citation(s) in RCA: 37] [Impact Index Per Article: 9.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/13/2020] [Revised: 09/29/2020] [Accepted: 10/07/2020] [Indexed: 12/13/2022] Open
Abstract
Hepatitis E virus (HEV) (family Hepeviridae) is one of the most common human pathogens, causing acute hepatitis and an increasingly recognized etiological agent in chronic hepatitis and extrahepatic manifestations. Recent studies reported that not only are the classical members of the species Orthohepevirus A (HEV-A) pathogenic to humans but a genetically highly divergent rat origin hepevirus (HEV-C1) in species Orthohepevirus C (HEV-C) is also able to cause zoonotic infection and symptomatic disease (hepatitis) in humans. This review summarizes the current knowledge of hepeviruses in rodents with special focus of rat origin HEV-C1. Cross-species transmission and genetic diversity of HEV-C1 and confirmation of HEV-C1 infections and symptomatic disease in humans re-opened the long-lasting and full of surprises story of HEV in human. This novel knowledge has a consequence to the epidemiology, clinical aspects, laboratory diagnosis, and prevention of HEV infection in humans.
Collapse
|
36
|
Abrouk M, Ahmed HI, Cubry P, Šimoníková D, Cauet S, Pailles Y, Bettgenhaeuser J, Gapa L, Scarcelli N, Couderc M, Zekraoui L, Kathiresan N, Čížková J, Hřibová E, Doležel J, Arribat S, Bergès H, Wieringa JJ, Gueye M, Kane NA, Leclerc C, Causse S, Vancoppenolle S, Billot C, Wicker T, Vigouroux Y, Barnaud A, Krattinger SG. Fonio millet genome unlocks African orphan crop diversity for agriculture in a changing climate. Nat Commun 2020; 11:4488. [PMID: 32901040 DOI: 10.1101/2020.04.11.037671] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/13/2020] [Accepted: 08/16/2020] [Indexed: 05/28/2023] Open
Abstract
Sustainable food production in the context of climate change necessitates diversification of agriculture and a more efficient utilization of plant genetic resources. Fonio millet (Digitaria exilis) is an orphan African cereal crop with a great potential for dryland agriculture. Here, we establish high-quality genomic resources to facilitate fonio improvement through molecular breeding. These include a chromosome-scale reference assembly and deep re-sequencing of 183 cultivated and wild Digitaria accessions, enabling insights into genetic diversity, population structure, and domestication. Fonio diversity is shaped by climatic, geographic, and ethnolinguistic factors. Two genes associated with seed size and shattering showed signatures of selection. Most known domestication genes from other cereal models however have not experienced strong selection in fonio, providing direct targets to rapidly improve this crop for agriculture in hot and dry environments.
Collapse
Affiliation(s)
- Michael Abrouk
- Center for Desert Agriculture, Biological and Environmental Science & Engineering Division (BESE), King Abdullah University of Science and Technology (KAUST), Thuwal, Saudi Arabia
| | - Hanin Ibrahim Ahmed
- Center for Desert Agriculture, Biological and Environmental Science & Engineering Division (BESE), King Abdullah University of Science and Technology (KAUST), Thuwal, Saudi Arabia
| | | | - Denisa Šimoníková
- Institute of Experimental Botany of the Czech Academy of Sciences, Centre of the Region Hana for Biotechnological and Agricultural Research, Olomouc, Czech Republic
| | | | - Yveline Pailles
- Center for Desert Agriculture, Biological and Environmental Science & Engineering Division (BESE), King Abdullah University of Science and Technology (KAUST), Thuwal, Saudi Arabia
| | - Jan Bettgenhaeuser
- Center for Desert Agriculture, Biological and Environmental Science & Engineering Division (BESE), King Abdullah University of Science and Technology (KAUST), Thuwal, Saudi Arabia
| | - Liubov Gapa
- Center for Desert Agriculture, Biological and Environmental Science & Engineering Division (BESE), King Abdullah University of Science and Technology (KAUST), Thuwal, Saudi Arabia
| | | | | | | | - Nagarajan Kathiresan
- Supercomputing Core Lab, King Abdullah University of Science and Technology (KAUST), Thuwal, Saudi Arabia
| | - Jana Čížková
- Institute of Experimental Botany of the Czech Academy of Sciences, Centre of the Region Hana for Biotechnological and Agricultural Research, Olomouc, Czech Republic
| | - Eva Hřibová
- Institute of Experimental Botany of the Czech Academy of Sciences, Centre of the Region Hana for Biotechnological and Agricultural Research, Olomouc, Czech Republic
| | - Jaroslav Doležel
- Institute of Experimental Botany of the Czech Academy of Sciences, Centre of the Region Hana for Biotechnological and Agricultural Research, Olomouc, Czech Republic
| | | | - Hélène Bergès
- CNRGV Plant Genomics Center, INRAE, Toulouse, France
- Inari Agriculture, One Kendall Square Building 600/700, Cambridge, MA, 02139, USA
| | | | - Mathieu Gueye
- Laboratoire de Botanique, Département de Botanique et Géologie, IFAN Ch. A. Diop/UCAD, Dakar, Senegal
| | - Ndjido A Kane
- Senegalese Agricultural Research Institute, Dakar, Senegal
- Laboratoire Mixte International LAPSE, Dakar, Senegal
| | - Christian Leclerc
- CIRAD, UMR AGAP, Montpellier, France
- AGAP, Université de Montpellier, Cirad, INRAE, Institut Agro, Montpellier, France
| | - Sandrine Causse
- CIRAD, UMR AGAP, Montpellier, France
- AGAP, Université de Montpellier, Cirad, INRAE, Institut Agro, Montpellier, France
| | - Sylvie Vancoppenolle
- CIRAD, UMR AGAP, Montpellier, France
- AGAP, Université de Montpellier, Cirad, INRAE, Institut Agro, Montpellier, France
| | - Claire Billot
- CIRAD, UMR AGAP, Montpellier, France
- AGAP, Université de Montpellier, Cirad, INRAE, Institut Agro, Montpellier, France
| | - Thomas Wicker
- Department of Plant and Microbial Biology, University of Zurich, Zürich, Switzerland
| | | | - Adeline Barnaud
- DIADE, Univ Montpellier, IRD, Montpellier, France.
- Laboratoire Mixte International LAPSE, Dakar, Senegal.
| | - Simon G Krattinger
- Center for Desert Agriculture, Biological and Environmental Science & Engineering Division (BESE), King Abdullah University of Science and Technology (KAUST), Thuwal, Saudi Arabia.
| |
Collapse
|
37
|
Abrouk M, Ahmed HI, Cubry P, Šimoníková D, Cauet S, Pailles Y, Bettgenhaeuser J, Gapa L, Scarcelli N, Couderc M, Zekraoui L, Kathiresan N, Čížková J, Hřibová E, Doležel J, Arribat S, Bergès H, Wieringa JJ, Gueye M, Kane NA, Leclerc C, Causse S, Vancoppenolle S, Billot C, Wicker T, Vigouroux Y, Barnaud A, Krattinger SG. Fonio millet genome unlocks African orphan crop diversity for agriculture in a changing climate. Nat Commun 2020; 11:4488. [PMID: 32901040 PMCID: PMC7479619 DOI: 10.1038/s41467-020-18329-4] [Citation(s) in RCA: 43] [Impact Index Per Article: 10.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/13/2020] [Accepted: 08/16/2020] [Indexed: 01/24/2023] Open
Abstract
Sustainable food production in the context of climate change necessitates diversification of agriculture and a more efficient utilization of plant genetic resources. Fonio millet (Digitaria exilis) is an orphan African cereal crop with a great potential for dryland agriculture. Here, we establish high-quality genomic resources to facilitate fonio improvement through molecular breeding. These include a chromosome-scale reference assembly and deep re-sequencing of 183 cultivated and wild Digitaria accessions, enabling insights into genetic diversity, population structure, and domestication. Fonio diversity is shaped by climatic, geographic, and ethnolinguistic factors. Two genes associated with seed size and shattering showed signatures of selection. Most known domestication genes from other cereal models however have not experienced strong selection in fonio, providing direct targets to rapidly improve this crop for agriculture in hot and dry environments.
Collapse
Affiliation(s)
- Michael Abrouk
- Center for Desert Agriculture, Biological and Environmental Science & Engineering Division (BESE), King Abdullah University of Science and Technology (KAUST), Thuwal, Saudi Arabia
| | - Hanin Ibrahim Ahmed
- Center for Desert Agriculture, Biological and Environmental Science & Engineering Division (BESE), King Abdullah University of Science and Technology (KAUST), Thuwal, Saudi Arabia
| | | | - Denisa Šimoníková
- Institute of Experimental Botany of the Czech Academy of Sciences, Centre of the Region Hana for Biotechnological and Agricultural Research, Olomouc, Czech Republic
| | | | - Yveline Pailles
- Center for Desert Agriculture, Biological and Environmental Science & Engineering Division (BESE), King Abdullah University of Science and Technology (KAUST), Thuwal, Saudi Arabia
| | - Jan Bettgenhaeuser
- Center for Desert Agriculture, Biological and Environmental Science & Engineering Division (BESE), King Abdullah University of Science and Technology (KAUST), Thuwal, Saudi Arabia
| | - Liubov Gapa
- Center for Desert Agriculture, Biological and Environmental Science & Engineering Division (BESE), King Abdullah University of Science and Technology (KAUST), Thuwal, Saudi Arabia
| | | | | | | | - Nagarajan Kathiresan
- Supercomputing Core Lab, King Abdullah University of Science and Technology (KAUST), Thuwal, Saudi Arabia
| | - Jana Čížková
- Institute of Experimental Botany of the Czech Academy of Sciences, Centre of the Region Hana for Biotechnological and Agricultural Research, Olomouc, Czech Republic
| | - Eva Hřibová
- Institute of Experimental Botany of the Czech Academy of Sciences, Centre of the Region Hana for Biotechnological and Agricultural Research, Olomouc, Czech Republic
| | - Jaroslav Doležel
- Institute of Experimental Botany of the Czech Academy of Sciences, Centre of the Region Hana for Biotechnological and Agricultural Research, Olomouc, Czech Republic
| | | | - Hélène Bergès
- CNRGV Plant Genomics Center, INRAE, Toulouse, France
- Inari Agriculture, One Kendall Square Building 600/700, Cambridge, MA, 02139, USA
| | | | - Mathieu Gueye
- Laboratoire de Botanique, Département de Botanique et Géologie, IFAN Ch. A. Diop/UCAD, Dakar, Senegal
| | - Ndjido A Kane
- Senegalese Agricultural Research Institute, Dakar, Senegal
- Laboratoire Mixte International LAPSE, Dakar, Senegal
| | - Christian Leclerc
- CIRAD, UMR AGAP, Montpellier, France
- AGAP, Université de Montpellier, Cirad, INRAE, Institut Agro, Montpellier, France
| | - Sandrine Causse
- CIRAD, UMR AGAP, Montpellier, France
- AGAP, Université de Montpellier, Cirad, INRAE, Institut Agro, Montpellier, France
| | - Sylvie Vancoppenolle
- CIRAD, UMR AGAP, Montpellier, France
- AGAP, Université de Montpellier, Cirad, INRAE, Institut Agro, Montpellier, France
| | - Claire Billot
- CIRAD, UMR AGAP, Montpellier, France
- AGAP, Université de Montpellier, Cirad, INRAE, Institut Agro, Montpellier, France
| | - Thomas Wicker
- Department of Plant and Microbial Biology, University of Zurich, Zürich, Switzerland
| | | | - Adeline Barnaud
- DIADE, Univ Montpellier, IRD, Montpellier, France.
- Laboratoire Mixte International LAPSE, Dakar, Senegal.
| | - Simon G Krattinger
- Center for Desert Agriculture, Biological and Environmental Science & Engineering Division (BESE), King Abdullah University of Science and Technology (KAUST), Thuwal, Saudi Arabia.
| |
Collapse
|
38
|
Jiang P, Luo J, Wang Y, Deng P, Schmidt B, Tang X, Chen N, Wong L, Zhao L. kmcEx: memory-frugal and retrieval-efficient encoding of counted k-mers. Bioinformatics 2020; 35:4871-4878. [PMID: 31038666 DOI: 10.1093/bioinformatics/btz299] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/29/2018] [Revised: 04/02/2019] [Accepted: 04/19/2019] [Indexed: 12/25/2022] Open
Abstract
MOTIVATION K-mers along with their frequency have served as an elementary building block for error correction, repeat detection, multiple sequence alignment, genome assembly, etc., attracting intensive studies in k-mer counting. However, the output of k-mer counters itself is large; very often, it is too large to fit into main memory, leading to highly narrowed usability. RESULTS We introduce a novel idea of encoding k-mers as well as their frequency, achieving good memory saving and retrieval efficiency. Specifically, we propose a Bloom filter-like data structure to encode counted k-mers by coupled-bit arrays-one for k-mer representation and the other for frequency encoding. Experiments on five real datasets show that the average memory-saving ratio on all 31-mers is as high as 13.81 as compared with raw input, with 7 hash functions. At the same time, the retrieval time complexity is well controlled (effectively constant), and the false-positive rate is decreased by two orders of magnitude. AVAILABILITY AND IMPLEMENTATION The source codes of our algorithm are available at github.com/lzhLab/kmcEx. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Peng Jiang
- Precision Medicine Research Center, Taihe Hospital, Hubei University of Medicine, Shiyan, Hubei, China
| | - Jie Luo
- Precision Medicine Research Center, Taihe Hospital, Hubei University of Medicine, Shiyan, Hubei, China
| | - Yiqi Wang
- Precision Medicine Research Center, Taihe Hospital, Hubei University of Medicine, Shiyan, Hubei, China
| | - Pingji Deng
- Precision Medicine Research Center, Taihe Hospital, Hubei University of Medicine, Shiyan, Hubei, China
| | - Bertil Schmidt
- Institute of Computer Science, Johannes Gutenberg University Mainz, Mainz Germany
| | - Xiangjun Tang
- Precision Medicine Research Center, Taihe Hospital, Hubei University of Medicine, Shiyan, Hubei, China
| | - Ningjiang Chen
- School of Computing and Electronic Information, Guangxi University, Nanning, Guangxi, China
| | - Limsoon Wong
- School of Computing, National University of Singapore, Singapore, Singapore
| | - Liang Zhao
- Precision Medicine Research Center, Taihe Hospital, Hubei University of Medicine, Shiyan, Hubei, China.,School of Computing and Electronic Information, Guangxi University, Nanning, Guangxi, China
| |
Collapse
|
39
|
Borredá C, Pérez-Román E, Ibanez V, Terol J, Talon M. Reprogramming of Retrotransposon Activity during Speciation of the Genus Citrus. Genome Biol Evol 2020; 11:3478-3495. [PMID: 31710678 PMCID: PMC7145672 DOI: 10.1093/gbe/evz246] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 11/04/2019] [Indexed: 12/13/2022] Open
Abstract
Speciation of the genus Citrus from a common ancestor has recently been established to begin ∼8 Ma during the late Miocene, a period of major climatic alterations. Here, we report the changes in activity of Citrus LTR retrotransposons during the process of diversification that gave rise to the current Citrus species. To reach this goal, we analyzed four pure species that diverged early during Citrus speciation, three recent admixtures derived from those species and an outgroup of the Citrus clade. More than 30,000 retrotransposons were grouped in ten linages. Estimations of LTR insertion times revealed that retrotransposon activity followed a species-specific pattern of change that could be ascribed to one of three different models. In some genomes, the expected pattern of gradual transposon accumulation was suddenly arrested during the radiation of the ancestor that gave birth to the current Citrus species. The individualized analyses of retrotransposon lineages showed that in each and every species studied, not all lineages follow the general pattern of the species itself. For instance, in most of the genomes, the retrotransposon activity of elements from the SIRE lineage reached its highest level just before Citrus speciation, while for Retrofit elements, it has been steadily growing. Based on these observations, we propose that Citrus retrotransposons may respond to stressful conditions driving speciation as a part of the genetic response involved in adaptation. This proposal implies that the evolving conditions of each species interact with the internal regulatory mechanisms of the genome controlling the proliferation of mobile elements.
Collapse
Affiliation(s)
- Carles Borredá
- Centro de Genómica, Instituto Valenciano de Investigaciones Agrarias (IVIA), Valencia, Spain
| | - Estela Pérez-Román
- Centro de Genómica, Instituto Valenciano de Investigaciones Agrarias (IVIA), Valencia, Spain
| | - Victoria Ibanez
- Centro de Genómica, Instituto Valenciano de Investigaciones Agrarias (IVIA), Valencia, Spain
| | - Javier Terol
- Centro de Genómica, Instituto Valenciano de Investigaciones Agrarias (IVIA), Valencia, Spain
| | - Manuel Talon
- Centro de Genómica, Instituto Valenciano de Investigaciones Agrarias (IVIA), Valencia, Spain
| |
Collapse
|
40
|
Sahlin K, Medvedev P. De Novo Clustering of Long-Read Transcriptome Data Using a Greedy, Quality Value-Based Algorithm. J Comput Biol 2020; 27:472-484. [PMID: 32181688 DOI: 10.1089/cmb.2019.0299] [Citation(s) in RCA: 37] [Impact Index Per Article: 9.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/25/2022] Open
Abstract
Long-read sequencing of transcripts with Pacific Biosciences (PacBio) Iso-Seq and Oxford Nanopore Technologies has proven to be central to the study of complex isoform landscapes in many organisms. However, current de novo transcript reconstruction algorithms from long-read data are limited, leaving the potential of these technologies unfulfilled. A common bottleneck is the dearth of scalable and accurate algorithms for clustering long reads according to their gene family of origin. To address this challenge, we develop isONclust, a clustering algorithm that is greedy (to scale) and makes use of quality values (to handle variable error rates). We test isONclust on three simulated and five biological data sets, across a breadth of organisms, technologies, and read depths. Our results demonstrate that isONclust is a substantial improvement over previous approaches, both in terms of overall accuracy and/or scalability to large data sets.
Collapse
Affiliation(s)
- Kristoffer Sahlin
- Department of Computer Science and Engineering and Pennsylvania State University, University Park, Pennsylvania
| | - Paul Medvedev
- Department of Computer Science and Engineering and Pennsylvania State University, University Park, Pennsylvania.,Department of Biochemistry and Molecular Biology, Pennsylvania State University, University Park, Pennsylvania.,Center for Computational Biology and Bioinformatics, Pennsylvania State University, University Park, Pennsylvania
| |
Collapse
|
41
|
Valencia JD, Girgis HZ. LtrDetector: A tool-suite for detecting long terminal repeat retrotransposons de-novo. BMC Genomics 2019; 20:450. [PMID: 31159720 PMCID: PMC6547461 DOI: 10.1186/s12864-019-5796-9] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/26/2018] [Accepted: 05/14/2019] [Indexed: 12/19/2022] Open
Abstract
BACKGROUND Long terminal repeat retrotransposons are the most abundant transposons in plants. They play important roles in alternative splicing, recombination, gene regulation, and defense mechanisms. Large-scale sequencing projects for plant genomes are currently underway. Software tools are important for annotating long terminal repeat retrotransposons in these newly available genomes. However, the available tools are not very sensitive to known elements and perform inconsistently on different genomes. Some are hard to install or obsolete. They may struggle to process large plant genomes. None can be executed in parallel out of the box and very few have features to support visual review of new elements. To overcome these limitations, we developed LtrDetector, which uses techniques inspired by signal-processing. RESULTS We compared LtrDetector to LTR_Finder and LTRharvest, the two most successful predecessor tools, on six plant genomes. For each organism, we constructed a ground truth data set based on queries from a consensus sequence database. According to this evaluation, LtrDetector was the most sensitive tool, achieving 16-23% improvement in sensitivity over LTRharvest and 21% improvement over LTR_Finder. All three tools had low false positive rates, with LtrDetector achieving 98.2% precision, in between its two competitors. Overall, LtrDetector provides the best compromise between high sensitivity and low false positive rate while requiring moderate time and utilizing memory available on personal computers. CONCLUSIONS LtrDetector uses a novel methodology revolving around k-mer distributions, which allows it to produce high-quality results using relatively lightweight procedures. It is easy to install and use. It is not species specific, performing well using its default parameters on genomes of varying size and repeat content. It is automatically configured for parallel execution and runs efficiently on an ordinary personal computer. It includes a k-mer scores visualization tool to facilitate manual review of the identified elements. These features make LtrDetector an attractive tool for future annotation projects involving long terminal repeat retrotransposons.
Collapse
Affiliation(s)
- Joseph D Valencia
- The Bioinformatics Toolsmith Laboratory, Tandy School of Computer Science, University of Tulsa, 800 South Tucker Drive, Tulsa, 74104, OK, USA
| | - Hani Z Girgis
- The Bioinformatics Toolsmith Laboratory, Tandy School of Computer Science, University of Tulsa, 800 South Tucker Drive, Tulsa, 74104, OK, USA.
| |
Collapse
|
42
|
Tight clustering for large datasets with an application to gene expression data. Sci Rep 2019; 9:3053. [PMID: 30816195 PMCID: PMC6395712 DOI: 10.1038/s41598-019-39459-w] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/08/2018] [Accepted: 01/25/2019] [Indexed: 11/24/2022] Open
Abstract
This article proposes a practical and scalable version of the tight clustering algorithm. The tight clustering algorithm provides tight and stable relevant clusters as output while leaving a set of points as noise or scattered points, that would not go into any cluster. However, the computational limitation to achieve this precise target of tight clusters prohibits it from being used for large microarray gene expression data or any other large data set, which are common nowadays. We propose a pragmatic and scalable version of the tight clustering method that is applicable to data sets of very large size and deduce the properties of the proposed algorithm. We validate our algorithm with extensive simulation study and multiple real data analyses including analysis of real data on gene expression.
Collapse
|