1
|
Xie J, Liu X, Qin Z, Mei S, Tarafder E, Li C, Zeng X, Tian F. Evolution and related pathogenic genes of Pseudodiploöspora longispora on Morchella based on genomic characterization and comparative genomic analysis. Sci Rep 2024; 14:18588. [PMID: 39127740 PMCID: PMC11316761 DOI: 10.1038/s41598-024-69421-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/06/2024] [Accepted: 08/05/2024] [Indexed: 08/12/2024] Open
Abstract
True morels (Morchella) are globally renowned medicinal and edible mushrooms. White mold disease caused by fungi is the main disease of Morchella, which has the characteristics of wide incidence and strong destructiveness. The disparities observed in the isolation rates of different pathogens indicate their varying degrees of host adaptability and competitive survival abilities. In order to elucidate its potential mechanism, this study, the pathogen of white mold disease from Dafang county, Guizhou Province was isolated and purified, identified as Pseudodiploöspora longispora by morphological, molecular biological and pathogenicity tests. Furthermore, high-quality genome of P. longisporus (40.846 Mb) was assembled N50 of 3.09 Mb, predicts 7381 protein-coding genes. Phylogenetic analysis of single-copy homologous genes showed that P. longispora and Zelopaecilomyces penicillatus have the closest evolutionary relationship, diverging into two branches approximately 50 (44.3-61.4) MYA. Additionally, compared with the other two pathogens causing Morchella disease, Z. penicillatus and Cladobotryum protrusum, it was found that they had similar proportions of carbohydrate enzyme types and encoded abundant cell wall degrading enzymes, such as chitinase and glucanase, indicating their important role in disease development. Moreover, the secondary metabolite gene clusters of P. longispora and Z. penicillatus show a high degree of similarity to leucinostatin A and leucinostatin B (peptaibols). Furthermore, a gene cluster with synthetic toxic substance Ochratoxin A was also identified in P. longispora and C. protrusum, indicating that they may pose a potential threat to food safety. This study provides valuable insights into the genome of P. longispora, contributing to pathogenicity research.
Collapse
Affiliation(s)
- Jiangtao Xie
- Department of Plant Pathology, College of Agriculture, Guizhou University, Guiyang, Guizhou, China
- Institute of Edible Mushroom, Guizhou University, Guiyang, China
| | - Xue Liu
- Department of Plant Pathology, College of Agriculture, Guizhou University, Guiyang, Guizhou, China
- Institute of Edible Mushroom, Guizhou University, Guiyang, China
| | - Zaili Qin
- Department of Plant Pathology, College of Agriculture, Guizhou University, Guiyang, Guizhou, China
- Institute of Edible Mushroom, Guizhou University, Guiyang, China
| | - Shihui Mei
- Department of Plant Pathology, College of Agriculture, Guizhou University, Guiyang, Guizhou, China
- Institute of Edible Mushroom, Guizhou University, Guiyang, China
| | - Entaj Tarafder
- Department of Plant Pathology, College of Agriculture, Guizhou University, Guiyang, Guizhou, China
| | - Chao Li
- Department of Plant Pathology, College of Agriculture, Guizhou University, Guiyang, Guizhou, China
- Institute of Edible Mushroom, Guizhou University, Guiyang, China
| | - Xiangyu Zeng
- Department of Plant Pathology, College of Agriculture, Guizhou University, Guiyang, Guizhou, China
- Institute of Edible Mushroom, Guizhou University, Guiyang, China
| | - Fenghua Tian
- Department of Plant Pathology, College of Agriculture, Guizhou University, Guiyang, Guizhou, China.
- Engineering Research Center of Chinese Ministry of Education for Edible and Medicinal Fungi, Jilin Agricultural University, Changchun, Jilin, China.
- Guizhou Key Laboratory of Edible Fungi Breeding, Guiyang, China.
- Institute of Edible Mushroom, Guizhou University, Guiyang, China.
- Tianiin Institute of Industrial Biotechnology, Chinese Academy of Sciences, Tianjin, China.
| |
Collapse
|
2
|
Hoang M, Marçais G, Kingsford C. Density and Conservation Optimization of the Generalized Masked-Minimizer Sketching Scheme. J Comput Biol 2024; 31:2-20. [PMID: 37975802 PMCID: PMC10794853 DOI: 10.1089/cmb.2023.0212] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2023] Open
Abstract
Minimizers and syncmers are sketching methods that sample representative k-mer seeds from a long string. The minimizer scheme guarantees a well-spread k-mer sketch (high coverage) while seeking to minimize the sketch size (low density). The syncmer scheme yields sketches that are more robust to base substitutions (high conservation) on random sequences, but do not have the coverage guarantee of minimizers. These sketching metrics are generally adversarial to one another, especially in the context of sketch optimization for a specific sequence, and thus are difficult to be simultaneously achieved. The parameterized syncmer scheme was recently introduced as a generalization of syncmers with more flexible sampling rules and empirically better coverage than the original syncmer variants. However, no approach exists to optimize parameterized syncmers. To address this shortcoming, we introduce a new scheme called masked minimizers that generalizes minimizers in manner analogous to how parameterized syncmers generalize syncmers and allows us to extend existing optimization techniques developed for minimizers. This results in a practical algorithm to optimize the masked minimizer scheme with respect to both density and conservation. We evaluate the optimization algorithm on various benchmark genomes and show that our algorithm finds sketches that are overall more compact, well-spread, and robust to substitutions than those found by previous methods. Our implementation is released at https://github.com/Kingsford-Group/maskedminimizer. This new technique will enable more efficient and robust genomic analyses in the many settings where minimizers and syncmers are used.
Collapse
Affiliation(s)
- Minh Hoang
- Department of Computer Science, and Carnegie Mellon University, Pittsburgh, Pennsylvania, USA
| | - Guillaume Marçais
- Department of Computational Biology, Carnegie Mellon University, Pittsburgh, Pennsylvania, USA
| | - Carl Kingsford
- Department of Computational Biology, Carnegie Mellon University, Pittsburgh, Pennsylvania, USA
| |
Collapse
|
3
|
Zheng H, Marçais G, Kingsford C. Creating and Using Minimizer Sketches in Computational Genomics. J Comput Biol 2023; 30:1251-1276. [PMID: 37646787 PMCID: PMC11082048 DOI: 10.1089/cmb.2023.0094] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 09/01/2023] Open
Abstract
Processing large data sets has become an essential part of computational genomics. Greatly increased availability of sequence data from multiple sources has fueled breakthroughs in genomics and related fields but has led to computational challenges processing large sequencing experiments. The minimizer sketch is a popular method for sequence sketching that underlies core steps in computational genomics such as read mapping, sequence assembling, k-mer counting, and more. In most applications, minimizer sketches are constructed using one of few classical approaches. More recently, efforts have been put into building minimizer sketches with desirable properties compared with the classical constructions. In this survey, we review the history of the minimizer sketch, the theories developed around the concept, and the plethora of applications taking advantage of such sketches. We aim to provide the readers a comprehensive picture of the research landscape involving minimizer sketches, in anticipation of better fusion of theory and application in the future.
Collapse
Affiliation(s)
- Hongyu Zheng
- Computer Science Department, Princeton University, Princeton, New Jersey, USA
| | - Guillaume Marçais
- Computational Biology Department, Carnegie Mellon University, Pittsburgh, Pennsylvania, USA
| | - Carl Kingsford
- Computational Biology Department, Carnegie Mellon University, Pittsburgh, Pennsylvania, USA
| |
Collapse
|
4
|
Pellow D, Pu L, Ekim B, Kotlar L, Berger B, Shamir R, Orenstein Y. Efficient minimizer orders for large values of k using minimum decycling sets. Genome Res 2023; 33:1154-1161. [PMID: 37558282 PMCID: PMC10538483 DOI: 10.1101/gr.277644.123] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/05/2023] [Accepted: 04/20/2023] [Indexed: 08/11/2023]
Abstract
Minimizers are ubiquitously used in data structures and algorithms for efficient searching, mapping, and indexing of high-throughput DNA sequencing data. Minimizer schemes select a minimum k-mer in every L-long subsequence of the target sequence, where minimality is with respect to a predefined k-mer order. Commonly used minimizer orders select more k-mers than necessary and therefore provide limited improvement in runtime and memory usage of downstream analysis tasks. The recently introduced universal k-mer hitting sets produce minimizer orders with fewer selected k-mers. Generating compact universal k-mer hitting sets is currently infeasible for k > 13, and thus, they cannot help in the many applications that require minimizer orders for larger k Here, we close the gap of efficient minimizer orders for large values of k by introducing decycling-set-based minimizer orders: new minimizer orders based on minimum decycling sets. We show that in practice these new minimizer orders select a number of k-mers comparable to that of minimizer orders based on universal k-mer hitting sets and can also scale to a larger k Furthermore, we developed a method that computes the minimizers in a sequence on the fly without keeping the k-mers of a decycling set in memory. This enables the use of these minimizer orders for any value of k We expect the new orders to improve the runtime and memory usage of algorithms and data structures in high-throughput DNA sequencing analysis.
Collapse
Affiliation(s)
- David Pellow
- Blavatnik School of Computer Science, Tel-Aviv University, Tel Aviv 6997801, Israel
| | - Lianrong Pu
- Blavatnik School of Computer Science, Tel-Aviv University, Tel Aviv 6997801, Israel
| | - Bariş Ekim
- Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, Massachusetts 02139, USA
| | - Lior Kotlar
- Department of Computer Science, Ben-Gurion University of the Negev, Beer-Sheva 8410501, Israel
| | - Bonnie Berger
- Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, Massachusetts 02139, USA
- Department of Mathematics, Massachusetts Institute of Technology, Cambridge, Massachusetts 02139, USA
| | - Ron Shamir
- Blavatnik School of Computer Science, Tel-Aviv University, Tel Aviv 6997801, Israel;
| | - Yaron Orenstein
- Department of Computer Science, Bar-Ilan University, Ramat-Gan 5290002, Israel;
- The Mina and Everard Goodman Faculty of Life Sciences, Bar-Ilan University, Ramat-Gan 5290002, Israel
| |
Collapse
|
5
|
Chicco D, Ferraro Petrillo U, Cattaneo G. Ten quick tips for bioinformatics analyses using an Apache Spark distributed computing environment. PLoS Comput Biol 2023; 19:e1011272. [PMID: 37471333 PMCID: PMC10358940 DOI: 10.1371/journal.pcbi.1011272] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 07/22/2023] Open
Abstract
Some scientific studies involve huge amounts of bioinformatics data that cannot be analyzed on personal computers usually employed by researchers for day-to-day activities but rather necessitate effective computational infrastructures that can work in a distributed way. For this purpose, distributed computing systems have become useful tools to analyze large amounts of bioinformatics data and to generate relevant results on virtual environments, where software can be executed for hours or even days without affecting the personal computer or laptop of a researcher. Even if distributed computing resources have become pivotal in multiple bioinformatics laboratories, often researchers and students use them in the wrong ways, making mistakes that can cause the distributed computers to underperform or that can even generate wrong outcomes. In this context, we present here ten quick tips for the usage of Apache Spark distributed computing systems for bioinformatics analyses: ten simple guidelines that, if taken into account, can help users avoid common mistakes and can help them run their bioinformatics analyses smoothly. Even if we designed our recommendations for beginners and students, they should be followed by experts too. We think our quick tips can help anyone make use of Apache Spark distributed computing systems more efficiently and ultimately help generate better, more reliable scientific results.
Collapse
Affiliation(s)
- Davide Chicco
- Institute of Health Policy Management and Evaluation, University of Toronto, Toronto, Ontario, Canada
| | | | - Giuseppe Cattaneo
- Dipartimento di Informatica, Università di Salerno, Fisciano (Salerno), Italy
| |
Collapse
|
6
|
Frith MC, Shaw J, Spouge JL. How to optimally sample a sequence for rapid analysis. Bioinformatics 2023; 39:btad057. [PMID: 36702468 PMCID: PMC9907223 DOI: 10.1093/bioinformatics/btad057] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/10/2022] [Accepted: 01/24/2023] [Indexed: 01/28/2023] Open
Abstract
MOTIVATION We face an increasing flood of genetic sequence data, from diverse sources, requiring rapid computational analysis. Rapid analysis can be achieved by sampling a subset of positions in each sequence. Previous sequence-sampling methods, such as minimizers, syncmers and minimally overlapping words, were developed by heuristic intuition, and are not optimal. RESULTS We present a sequence-sampling approach that provably optimizes sensitivity for a whole class of sequence comparison methods, for randomly evolving sequences. It is likely near-optimal for a wide range of alignment-based and alignment-free analyses. For real biological DNA, it increases specificity by avoiding simple repeats. Our approach generalizes universal hitting sets (which guarantee to sample a sequence at least once) and polar sets (which guarantee to sample a sequence at most once). This helps us understand how to do rapid sequence analysis as accurately as possible. AVAILABILITY AND IMPLEMENTATION Source code is freely available at https://gitlab.com/mcfrith/noverlap. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Martin C Frith
- Artificial Intelligence Research Center, AIST, Tokyo 135-0064, Japan
- Department of Computational Biology and Medical Sciences, Graduate School of Frontier Sciences, University of Tokyo, Chiba 277-8568, Japan
- Computational Bio Big-Data Open Innovation Laboratory, AIST, Tokyo 169-8555, Japan
| | - Jim Shaw
- Department of Mathematics, University of Toronto, Toronto, ON M5S 2E4, Canada
| | - John L Spouge
- National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, USA
| |
Collapse
|
7
|
Manconi A, Gnocchi M, Milanesi L, Marullo O, Armano G. Framing Apache Spark in life sciences. Heliyon 2023; 9:e13368. [PMID: 36852030 PMCID: PMC9958288 DOI: 10.1016/j.heliyon.2023.e13368] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/18/2022] [Revised: 01/19/2023] [Accepted: 01/29/2023] [Indexed: 02/11/2023] Open
Abstract
Advances in high-throughput and digital technologies have required the adoption of big data for handling complex tasks in life sciences. However, the drift to big data led researchers to face technical and infrastructural challenges for storing, sharing, and analysing them. In fact, this kind of tasks requires distributed computing systems and algorithms able to ensure efficient processing. Cutting edge distributed programming frameworks allow to implement flexible algorithms able to adapt the computation to the data over on-premise HPC clusters or cloud architectures. In this context, Apache Spark is a very powerful HPC engine for large-scale data processing on clusters. Also thanks to specialised libraries for working with structured and relational data, it allows to support machine learning, graph-based computation, and stream processing. This review article is aimed at helping life sciences researchers to ascertain the features of Apache Spark and to assess whether it can be successfully used in their research activities.
Collapse
Affiliation(s)
- Andrea Manconi
- Institute of Biomedical Technologies - National Research Council of Italy, Segrate (Mi), Italy
| | - Matteo Gnocchi
- Institute of Biomedical Technologies - National Research Council of Italy, Segrate (Mi), Italy
| | - Luciano Milanesi
- Institute of Biomedical Technologies - National Research Council of Italy, Segrate (Mi), Italy
| | - Osvaldo Marullo
- Department of Mathematics and Computer science - University of Cagliari, Cagliari, Italy
| | - Giuliano Armano
- Department of Mathematics and Computer science - University of Cagliari, Cagliari, Italy
| |
Collapse
|
8
|
Mao J, Wang Y, Wang B, Li J, Zhang C, Zhang W, Li X, Li J, Zhang J, Li H, Zhang Z. High-quality haplotype-resolved genome assembly of cultivated octoploid strawberry. HORTICULTURE RESEARCH 2023; 10:uhad002. [PMID: 37077373 PMCID: PMC10108017 DOI: 10.1093/hr/uhad002] [Citation(s) in RCA: 7] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 10/14/2022] [Accepted: 01/03/2023] [Indexed: 05/03/2023]
Abstract
Cultivated strawberry (Fragaria × ananassa), a perennial herb belonging to the family Rosaceae, is a complex octoploid with high heterozygosity at most loci. However, there is no research on the haplotype of the octoploid strawberry genome. Here we aimed to obtain a high-quality genome of the cultivated strawberry cultivar, "Yanli", using single molecule real-time sequencing and high-throughput chromosome conformation capture technology. The "Yanli" genome was 823 Mb in size, with a long terminal repeat assembly index of 14.99. The genome was phased into two haplotypes, Hap1 (825 Mb with contig N50 of 26.70 Mb) and Hap2 (808 Mb with contig N50 of 27.51 Mb). Using the combination of Hap1 and Hap2, we obtained for the first time a haplotype-resolved genome with 56 chromosomes for the cultivated octoploid strawberry. We identified a ~ 10 Mb inversion and translocation on chromosome 2-1. 104 957 and 102 356 protein-coding genes were annotated in Hap1 and Hap2, respectively. Analysis of the genes related to the anthocyanin biosynthesis pathway revealed the structural diversity and complexity in the expression of the alleles in the octoploid F. × ananassa genome. In summary, we obtained a high-quality haplotype-resolved genome assembly of F. × ananassa, which will provide the foundation for investigating gene function and evolution of the genome of cultivated octoploid strawberry.
Collapse
Affiliation(s)
| | | | - Baotian Wang
- Liaoning Key Laboratory of Strawberry Breeding and Cultivation, College of Horticulture, Shenyang Agricultural University, 120 Dongling Road, Shenyang 110866, China
| | - Jiqi Li
- Liaoning Key Laboratory of Strawberry Breeding and Cultivation, College of Horticulture, Shenyang Agricultural University, 120 Dongling Road, Shenyang 110866, China
| | - Chao Zhang
- Liaoning Key Laboratory of Strawberry Breeding and Cultivation, College of Horticulture, Shenyang Agricultural University, 120 Dongling Road, Shenyang 110866, China
| | - Wenshuo Zhang
- School of Information Science and Technology, ShanghaiTech University, 393 Middle Huaxia Road, Shanghai 201210, China
| | - Xue Li
- Liaoning Key Laboratory of Strawberry Breeding and Cultivation, College of Horticulture, Shenyang Agricultural University, 120 Dongling Road, Shenyang 110866, China
| | - Jie Li
- Liaoning Key Laboratory of Strawberry Breeding and Cultivation, College of Horticulture, Shenyang Agricultural University, 120 Dongling Road, Shenyang 110866, China
| | - Junxiang Zhang
- Liaoning Key Laboratory of Strawberry Breeding and Cultivation, College of Horticulture, Shenyang Agricultural University, 120 Dongling Road, Shenyang 110866, China
- Laboratory of Protected Horticulture (Shenyang Agricultural University), Ministry of Education, Shenyang 110866, China
| | - He Li
- Liaoning Key Laboratory of Strawberry Breeding and Cultivation, College of Horticulture, Shenyang Agricultural University, 120 Dongling Road, Shenyang 110866, China
- Laboratory of Protected Horticulture (Shenyang Agricultural University), Ministry of Education, Shenyang 110866, China
| | | |
Collapse
|
9
|
Flomin D, Pellow D, Shamir R. Data Set-Adaptive Minimizer Order Reduces Memory Usage in k-Mer Counting. J Comput Biol 2022; 29:825-838. [DOI: 10.1089/cmb.2021.0599] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Affiliation(s)
- Dan Flomin
- Blavatnik School of Computer Science, Tel-Aviv University, Tel-Aviv, Israel
| | - David Pellow
- Blavatnik School of Computer Science, Tel-Aviv University, Tel-Aviv, Israel
| | - Ron Shamir
- Blavatnik School of Computer Science, Tel-Aviv University, Tel-Aviv, Israel
| |
Collapse
|
10
|
Abstract
Motivation Minimizers are efficient methods to sample k-mers from genomic sequences that unconditionally preserve sufficiently long matches between sequences. Well-established methods to construct efficient minimizers focus on sampling fewer k-mers on a random sequence and use universal hitting sets (sets of k-mers that appear frequently enough) to upper bound the sketch size. In contrast, the problem of sequence-specific minimizers, which is to construct efficient minimizers to sample fewer k-mers on a specific sequence such as the reference genome, is less studied. Currently, the theoretical understanding of this problem is lacking, and existing methods do not specialize well to sketch specific sequences. Results We propose the concept of polar sets, complementary to the existing idea of universal hitting sets. Polar sets are k-mer sets that are spread out enough on the reference, and provably specialize well to specific sequences. Link energy measures how well spread out a polar set is, and with it, the sketch size can be bounded from above and below in a theoretically sound way. This allows for direct optimization of sketch size. We propose efficient heuristics to construct polar sets, and via experiments on the human reference genome, show their practical superiority in designing efficient sequence-specific minimizers. Availability and implementation A reference implementation and code for analyses under an open-source license are at https://github.com/kingsford-group/polarset. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Hongyu Zheng
- Computational Biology Department, School of Computer Science, Carnegie Mellon University, Pittsburgh, PA 15213, USA
| | - Carl Kingsford
- Computational Biology Department, School of Computer Science, Carnegie Mellon University, Pittsburgh, PA 15213, USA
| | - Guillaume Marçais
- Computational Biology Department, School of Computer Science, Carnegie Mellon University, Pittsburgh, PA 15213, USA
| |
Collapse
|