1
|
Prioritized candidate causal haplotype blocks in plant genome-wide association studies. PLoS Genet 2022; 18:e1010437. [PMID: 36251695 PMCID: PMC9612827 DOI: 10.1371/journal.pgen.1010437] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/02/2021] [Revised: 10/27/2022] [Accepted: 09/20/2022] [Indexed: 11/05/2022] Open
Abstract
Genome wide association studies (GWAS) can play an essential role in understanding genetic basis of complex traits in plants and animals. Conventional SNP-based linear mixed models (LMM) that marginally test single nucleotide polymorphisms (SNPs) have successfully identified many loci with major and minor effects in many GWAS. In plant, the relatively small population size in GWAS and the high genetic diversity found in many plant species can impede mapping efforts on complex traits. Here we present a novel haplotype-based trait fine-mapping framework, HapFM, to supplement current GWAS methods. HapFM uses genotype data to partition the genome into haplotype blocks, identifies haplotype clusters within each block, and then performs genome-wide haplotype fine-mapping to prioritize the candidate causal haplotype blocks of trait. We benchmarked HapFM, GEMMA, BSLMM, GMMAT, and BLINK in both simulated and real plant GWAS datasets. HapFM consistently resulted in higher mapping power than the other GWAS methods in high polygenicity simulation setting. Moreover, it resulted in smaller mapping intervals, especially in regions of high LD, achieved by prioritizing small candidate causal blocks in the larger haplotype blocks. In the Arabidopsis flowering time (FT10) datasets, HapFM identified four novel loci compared to GEMMA’s results, and the average mapping interval of HapFM was 9.6 times smaller than that of GEMMA. In conclusion, HapFM is tailored for plant GWAS to result in high mapping power on complex traits and improved on mapping resolution to facilitate crop improvement. Genome-wide association studies (GWAS) are commonly used in human and plant studies to identify genetic variants responsible for the phenotype of interest and provide foundations for studying disease mechanisms and crop improvement. Most GWAS models are developed and optimized using human datasets. However, the difference between human and plant datasets essentially limits their applications in plant studies, especially when mapping complex traits such as drought resistance and yield. In this study, we present a novel GWAS method, HapFM, tailored for plant datasets to overcome the difficulties of many conventional GWAS methods. HapFM resulted in higher statistical power than conventional GWAS methods for mapping complex traits in our simulation and real dataset analyses. In addition, HapFM reduced the mapping interval by prioritizing candidate causal regions in the genome, which benefits the downstream experimental studies. Last but not least, HapFM can incorporate biological annotations to increase statistical power further. Overall, HapFM balances statistical power, result interpretability, and downstream experimental verifiability.
Collapse
|
2
|
Couvin D, Segretier W, Stattner E, Rastogi N. Novel methods included in SpolLineages tool for fast and precise prediction of Mycobacterium tuberculosis complex spoligotype families. Database (Oxford) 2020; 2020:baaa108. [PMID: 33320180 PMCID: PMC7737520 DOI: 10.1093/database/baaa108] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/02/2020] [Revised: 11/12/2020] [Accepted: 11/20/2020] [Indexed: 11/18/2022]
Abstract
Bioinformatic tools are currently being developed to better understand the Mycobacterium tuberculosis complex (MTBC). Several approaches already exist for the identification of MTBC lineages using classical genotyping methods such as mycobacterial interspersed repetitive units-variable number of tandem DNA repeats and spoligotyping-based families. In the recently released SITVIT2 proprietary database of the Institut Pasteur de la Guadeloupe, a large number of spoligotype families were assigned by either manual curation/expertise or using an in-house algorithm. In this study, we present two complementary data-driven approaches allowing fast and precise family prediction from spoligotyping patterns. The first one is based on data transformation and the use of decision tree classifiers. In contrast, the second one searches for a set of simple rules using binary masks through a specifically designed evolutionary algorithm. The comparison with the three main approaches in the field highlighted the good performances of our contributions and the significant runtime gain. Finally, we propose the 'SpolLineages' software tool (https://github.com/dcouvin/SpolLineages), which implements these approaches for MTBC spoligotype families' identification.
Collapse
Affiliation(s)
- David Couvin
- WHO Supranational TB Reference Laboratory, Tuberculosis and Mycobacteria Unit, Institut Pasteur de la Guadeloupe, F-97183, Abymes, Guadeloupe, France
| | - Wilfried Segretier
- Laboratoire de Mathématiques Informatique et Applications (LAMIA), Université des Antilles, F-97154, Pointe-à-Pitre, Guadeloupe, France
| | - Erick Stattner
- Laboratoire de Mathématiques Informatique et Applications (LAMIA), Université des Antilles, F-97154, Pointe-à-Pitre, Guadeloupe, France
| | - Nalin Rastogi
- WHO Supranational TB Reference Laboratory, Tuberculosis and Mycobacteria Unit, Institut Pasteur de la Guadeloupe, F-97183, Abymes, Guadeloupe, France
| |
Collapse
|
3
|
Busch A, Homeier-Bachmann T, Abdel-Glil MY, Hackbart A, Hotzel H, Tomaso H. Using affinity propagation clustering for identifying bacterial clades and subclades with whole-genome sequences of Francisella tularensis. PLoS Negl Trop Dis 2020; 14:e0008018. [PMID: 32991594 PMCID: PMC7523947 DOI: 10.1371/journal.pntd.0008018] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/08/2019] [Accepted: 12/27/2019] [Indexed: 12/31/2022] Open
Abstract
By combining a reference-independent SNP analysis and average nucleotide identity (ANI) with affinity propagation clustering (APC), we developed a significantly improved methodology allowing resolving phylogenetic relationships, based on objective criteria. These bioinformatics tools can be used as a general ruler to determine phylogenetic relationships and clustering of bacteria, exemplary done with Francisella (F.) tularensis. Molecular epidemiology of F. tularensis is currently assessed mostly based on laboratory methods and molecular analysis. The high evolutionary stability and the clonal nature makes Francisella ideal for subtyping with single nucleotide polymorphisms (SNPs). Sequencing and real-time PCR can be used to validate the SNP analysis. We investigate whole-genome sequences of 155 F. tularensis subsp. holarctica isolates. Phylogenetic testing was based on SNPs and average nucleotide identity (ANI) as reference independent, alignment-free methods taking small-scale and large-scale differences within the genomes into account. Especially the whole genome SNP analysis with kSNP3.0 allowed deciphering quite subtle signals of systematic differences in molecular variation. Affinity propagation clustering (APC) resulted in three clusters showing the known clades B.4, B.6, and B.12. These data correlated with the results of real-time PCR assays targeting canSNPs loci. Additionally, we detected two subtle sub-clusters. SplitsTree was used with standard-setting using the aligned SNPs from Parsnps. Together APC, HierBAPS, and SplitsTree enabled us to generate hypotheses about epidemiologic relationships between bacterial clusters and describing the distribution of isolates. Our data indicate that the choice of the typing technique can increase our understanding of the pathogenesis and transmission of diseases with the eventual for prevention. This is opening perspectives to be applied to other bacterial species. The data provide evidence that Germany might be the collision zone where the clade B.12, also known as the East European clade, overlaps with the clade B.6, also known as the Iberian clade. Described methods allow generating a new, more detailed perspective for F. tularensis subsp. holarctica phylogeny. These results may encourage to determine phylogenetic relationships and clustering of other bacteria the same way.
Collapse
Affiliation(s)
- Anne Busch
- Friedrich-Loeffler-Institut, Federal Research Institute for Animal Health, Institute of Bacterial Infections and Zoonoses, Friedrich-Loeffler-Institut, Jena, Germany
- * E-mail:
| | - Timo Homeier-Bachmann
- Friedrich-Loeffler-Institut, Federal Research Institute for Animal Health, Institute of Epidemiology, Friedrich-Loeffler-Institut, Greifswald-Insel Riems, Germany
| | - Mostafa Y. Abdel-Glil
- Friedrich-Loeffler-Institut, Federal Research Institute for Animal Health, Institute of Bacterial Infections and Zoonoses, Friedrich-Loeffler-Institut, Jena, Germany
| | - Anja Hackbart
- Friedrich-Loeffler-Institut, Federal Research Institute for Animal Health, Institute of Bacterial Infections and Zoonoses, Friedrich-Loeffler-Institut, Jena, Germany
| | - Helmut Hotzel
- Friedrich-Loeffler-Institut, Federal Research Institute for Animal Health, Institute of Bacterial Infections and Zoonoses, Friedrich-Loeffler-Institut, Jena, Germany
| | - Herbert Tomaso
- Friedrich-Loeffler-Institut, Federal Research Institute for Animal Health, Institute of Bacterial Infections and Zoonoses, Friedrich-Loeffler-Institut, Jena, Germany
| |
Collapse
|
4
|
Abstract
Bacteria occur ubiquitously in nature and are broadly relevant throughout the food supply chain, with diverse and variable tolerance levels depending on their origin, biological role, and impact on the quality and safety of the product as well as on the health of the consumer. With increasing knowledge of and accessibility to the microbial composition of our environments, food supply, and host-associated microbiota, our understanding of and appreciation for the ratio of beneficial to undesirable bacteria are rapidly evolving. Therefore, there is a need for tools and technologies that allow definite, accurate, and high-resolution identification and typing of various groups of bacteria that include beneficial microbes such as starter cultures and probiotics, innocuous commensals, and undesirable pathogens and spoilage organisms. During the transition from the current molecular biology-based PFGE (pulsed-field gel electrophoresis) gold standard to the increasingly accessible omics-level whole-genome sequencing (WGS) N-gen standard, high-resolution technologies such as CRISPR-based genotyping constitute practical and powerful alternatives that provide valuable insights into genome microevolution and evolutionary trajectories. Indeed, several studies have shown potential for CRISPR-based typing of industrial starter cultures, health-promoting probiotic strains, animal commensal species, and problematic pathogens. Emerging CRISPR-based typing methods open new avenues for high-resolution typing of a broad range of bacteria and constitute a practical means for rapid tracking of a diversity of food-associated microbes.
Collapse
Affiliation(s)
- Rodolphe Barrangou
- Department of Food, Bioprocessing and Nutrition Sciences, North Carolina State University, Raleigh, North Carolina 27695; .,Department of Food Science, The Pennsylvania State University, University Park, Pennsylvania 16802;
| | - Edward G Dudley
- Department of Food Science, The Pennsylvania State University, University Park, Pennsylvania 16802;
| |
Collapse
|
5
|
Azé J, Sola C, Zhang J, Lafosse-Marin F, Yasmin M, Siddiqui R, Kremer K, van Soolingen D, Refrégier G. Genomics and Machine Learning for Taxonomy Consensus: The Mycobacterium tuberculosis Complex Paradigm. PLoS One 2015; 10:e0130912. [PMID: 26154264 PMCID: PMC4496040 DOI: 10.1371/journal.pone.0130912] [Citation(s) in RCA: 23] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/17/2015] [Accepted: 05/25/2015] [Indexed: 11/18/2022] Open
Abstract
Infra-species taxonomy is a prerequisite to compare features such as virulence in different pathogen lineages. Mycobacterium tuberculosis complex taxonomy has rapidly evolved in the last 20 years through intensive clinical isolation, advances in sequencing and in the description of fast-evolving loci (CRISPR and MIRU-VNTR). On-line tools to describe new isolates have been set up based on known diversity either on CRISPRs (also known as spoligotypes) or on MIRU-VNTR profiles. The underlying taxonomies are largely concordant but use different names and offer different depths. The objectives of this study were 1) to explicit the consensus that exists between the alternative taxonomies, and 2) to provide an on-line tool to ease classification of new isolates. Genotyping (24-VNTR, 43-spacers spoligotypes, IS6110-RFLP) was undertaken for 3,454 clinical isolates from the Netherlands (2004-2008). The resulting database was enlarged with African isolates to include most human tuberculosis diversity. Assignations were obtained using TB-Lineage, MIRU-VNTRPlus, SITVITWEB and an algorithm from Borile et al. By identifying the recurrent concordances between the alternative taxonomies, we proposed a consensus including 22 sublineages. Original and consensus assignations of the all isolates from the database were subsequently implemented into an ensemble learning approach based on Machine Learning tool Weka to derive a classification scheme. All assignations were reproduced with very good sensibilities and specificities. When applied to independent datasets, it was able to suggest new sublineages such as pseudo-Beijing. This Lineage Prediction tool, efficient on 15-MIRU, 24-VNTR and spoligotype data is available on the web interface “TBminer.” Another section of this website helps summarizing key molecular epidemiological data, easing tuberculosis surveillance. Altogether, we successfully used Machine Learning on a large dataset to set up and make available the first consensual taxonomy for human Mycobacterium tuberculosis complex. Additional developments using SNPs will help stabilizing it.
Collapse
Affiliation(s)
- Jérôme Azé
- LIRMM UM CNRS, UMR 5506, 860 rue de St Priest, 34095 Montpellier cedex 5, France
| | - Christophe Sola
- Institute for Integrative Biology of the Cell (I2BC), CEA, CNRS, Université Paris-Sud, rue Gregor Mendel, Bât 400, 91405 Orsay cedex, France
| | - Jian Zhang
- Institute for Integrative Biology of the Cell (I2BC), CEA, CNRS, Université Paris-Sud, rue Gregor Mendel, Bât 400, 91405 Orsay cedex, France
| | - Florian Lafosse-Marin
- Institute for Integrative Biology of the Cell (I2BC), CEA, CNRS, Université Paris-Sud, rue Gregor Mendel, Bât 400, 91405 Orsay cedex, France
| | - Memona Yasmin
- Pakistan Institute for Engineering and Applied Sciences (PIEAS), Lehtrar Road, Nilore, Islamabad, Pakistan
- Health Biotechnology Division, National Institute for Biotechnology and Genetic Engineering (NIBGE), P.O. Box # 577, Jhang Road, Faisalabad, Pakistan
| | - Rubina Siddiqui
- Health Biotechnology Division, National Institute for Biotechnology and Genetic Engineering (NIBGE), P.O. Box # 577, Jhang Road, Faisalabad, Pakistan
| | - Kristin Kremer
- National Institute for Public Health and the Environment, P.O. Box 1, 3720 BA Bilthoven, The Netherlands
| | - Dick van Soolingen
- National Institute for Public Health and the Environment, P.O. Box 1, 3720 BA Bilthoven, The Netherlands
- Department of Pulmonary Diseases and Department of Microbiology, Radbout University Nijmegen Medical Centre, University Lung Centre Dekkerswald, P.O. Box 9101, 6500 HB Nijmegen, The Netherlands
| | - Guislaine Refrégier
- Institute for Integrative Biology of the Cell (I2BC), CEA, CNRS, Université Paris-Sud, rue Gregor Mendel, Bât 400, 91405 Orsay cedex, France
- * E-mail:
| |
Collapse
|
6
|
Sola C. Clustured regularly interspersed short palindromic repeats (CRISPR) genetic diversity studies as a mean to reconstruct the evolution of the Mycobacterium tuberculosis complex. Tuberculosis (Edinb) 2015; 95 Suppl 1:S159-66. [PMID: 25748060 DOI: 10.1016/j.tube.2015.02.029] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/01/2022]
Abstract
The natural history of tuberculosis may be tackled by various means, among which the record of molecular scars that have been registered by the Mycobacterium tuberculosis complex (MTBC) genomes transmitted from patient to patient for tens of thousands years and possibly more. Recently discovered polymorphic loci, the CRISPR sequences, are indirect witnesses of the historical phage-bacteria struggle, and may be related to the time when the ancestor of today's tubercle bacilli were environmental bacteria, i.e. before becoming intracellular parasites. In this article, we present what are CRISPRs and try to summarize almost 20 years of research results obtained using the genetic diversity of the CRISPR loci in MTBC as a perspective for studying new models. We show that the study of the diversity of CRISPR sequences, thanks to «spoligotyping», has played a great role in our global understanding of the population structure of MTBC.
Collapse
Affiliation(s)
- Christophe Sola
- Institut de Biologie Intégrative de la Cellule (I2BC), CEA, CNRS, Université Paris-Saclay, Orsay, France.
| |
Collapse
|
7
|
Vasconcellos SEG, Acosta CC, Gomes LL, Conceição EC, Lima KV, de Araujo MI, Leite MDL, Tannure F, Caldas PCDS, Gomes HM, Santos AR, Gomgnimbou MK, Sola C, Couvin D, Rastogi N, Boechat N, Suffys PN. Strain classification of Mycobacterium tuberculosis isolates in Brazil based on genotypes obtained by spoligotyping, mycobacterial interspersed repetitive unit typing and the presence of large sequence and single nucleotide polymorphism. PLoS One 2014; 9:e107747. [PMID: 25314118 PMCID: PMC4196770 DOI: 10.1371/journal.pone.0107747] [Citation(s) in RCA: 27] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/01/2014] [Accepted: 08/21/2014] [Indexed: 11/26/2022] Open
Abstract
Rio de Janeiro is endemic for tuberculosis (TB) and presents the second largest prevalence of the disease in Brazil. Here, we present the bacterial population structure of 218 isolates of Mycobacterium tuberculosis, derived from 186 patients that were diagnosed between January 2008 and December 2009. Genotypes were generated by means of spoligotyping, 24 MIRU-VNTR typing and presence of fbpC103, RDRio and RD174. The results confirmed earlier data that predominant genotypes in Rio de Janeiro are those of the Euro American Lineages (99%). However, we observed differences between the classification by spoligotyping when comparing to that of 24 MIRU-VNTR typing, being respectively 43.6% vs. 62.4% of LAM, 34.9% vs. 9.6% of T and 18.3% vs. 21.5% of Haarlem. Among isolates classified as LAM by MIRU typing, 28.0% did not present the characteristic spoligotype profile with absence of spacers 21 to 24 and 32 to 36 and we designated these conveniently as “LAM-like”, 79.3% of these presenting the LAM-specific SNP fbpC103. The frequency of RDRio and RD174 in the LAM strains, as defined both by spoligotyping and 24 MIRU-VNTR loci, were respectively 11% and 15.4%, demonstrating that RD174 is not always a marker for LAM/RDRio strains. We conclude that, although spoligotyping alone is a tool for classification of strains of the Euro-American lineage, when combined with MIRU-VNTRs, SNPs and RD typing, it leads to a much better understanding of the bacterial population structure and phylogenetic relationships among strains of M. tuberculosis in regions with high incidence of TB.
Collapse
Affiliation(s)
- Sidra E. G. Vasconcellos
- Laboratory of Molecular Biology Applied to Mycobacteria, Oswaldo Cruz Institute, FIOCRUZ, Rio de Janeiro, Rio de Janeiro, Brazil
- Multidisciplinary Research Laboratory, University Hospital Clementino Fraga Filho – HUCFF, Federal University of Rio de Janeiro, Rio de Janeiro, Rio de Janeiro, Brazil
| | - Chyntia Carolina Acosta
- Laboratory of Cellular Microbiology, Oswaldo Cruz Institute, FIOCRUZ, Rio de Janeiro, Rio de Janeiro, Brazil
| | - Lia Lima Gomes
- Laboratory of Molecular Biology Applied to Mycobacteria, Oswaldo Cruz Institute, FIOCRUZ, Rio de Janeiro, Rio de Janeiro, Brazil
| | | | - Karla Valéria Lima
- Instituto Evandro Chagas, Section of Bacteriology and Mycology, Belém, Pará, Brazil
| | - Marcelo Ivens de Araujo
- Laboratory of Molecular Biology Applied to Mycobacteria, Oswaldo Cruz Institute, FIOCRUZ, Rio de Janeiro, Rio de Janeiro, Brazil
| | - Maria de Lourdes Leite
- Hospital Municipal Rafael de Paula Souza, Municipal Secretary of Health, Rio de Janeiro, Rio de Janeiro, Brazil
| | - Flávio Tannure
- Hospital Municipal Rafael de Paula Souza, Municipal Secretary of Health, Rio de Janeiro, Rio de Janeiro, Brazil
| | - Paulo Cesar de Souza Caldas
- Centro de Referência Professor Hélio Fraga, Escola Nacional de Saúde Publica Sergio Arouca, FIOCRUZ, Rio de Janeiro, Rio de Janeiro, Brazil
| | - Harrison M. Gomes
- Laboratory of Molecular Biology Applied to Mycobacteria, Oswaldo Cruz Institute, FIOCRUZ, Rio de Janeiro, Rio de Janeiro, Brazil
| | - Adalberto Rezende Santos
- Laboratory of Molecular Biology Applied to Mycobacteria, Oswaldo Cruz Institute, FIOCRUZ, Rio de Janeiro, Rio de Janeiro, Brazil
| | - Michel K. Gomgnimbou
- CNRS–Université Paris–Sud, Institut de Génétique et Microbiologie–Infection Genetics Emerging Pathogens Evolution Team, Orsay, France
| | - Christophe Sola
- CNRS–Université Paris–Sud, Institut de Génétique et Microbiologie–Infection Genetics Emerging Pathogens Evolution Team, Orsay, France
| | - David Couvin
- Supranational TB Reference Laboratory, Unité de la Tuberculose et des Mycobactéries, Institut Pasteur de Guadeloupe, Abymes, Guadeloupe, France
| | - Nalin Rastogi
- Supranational TB Reference Laboratory, Unité de la Tuberculose et des Mycobactéries, Institut Pasteur de Guadeloupe, Abymes, Guadeloupe, France
| | - Neio Boechat
- Multidisciplinary Research Laboratory, University Hospital Clementino Fraga Filho – HUCFF, Federal University of Rio de Janeiro, Rio de Janeiro, Rio de Janeiro, Brazil
- Graduate Program in Clinical Medicine, Faculty of Medicine, University Hospital Clementino Fraga Filho, Rio de Janeiro, Rio de Janeiro, Brazil
| | - Philip Noel Suffys
- Laboratory of Molecular Biology Applied to Mycobacteria, Oswaldo Cruz Institute, FIOCRUZ, Rio de Janeiro, Rio de Janeiro, Brazil
- * E-mail:
| |
Collapse
|
8
|
Wang M, Zhang W, Ding W, Dai D, Zhang H, Xie H, Chen L, Guo Y, Xie J. Parallel clustering algorithm for large-scale biological data sets. PLoS One 2014; 9:e91315. [PMID: 24705246 PMCID: PMC3976248 DOI: 10.1371/journal.pone.0091315] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/13/2013] [Accepted: 02/10/2014] [Indexed: 02/06/2023] Open
Abstract
BACKGROUNDS Recent explosion of biological data brings a great challenge for the traditional clustering algorithms. With increasing scale of data sets, much larger memory and longer runtime are required for the cluster identification problems. The affinity propagation algorithm outperforms many other classical clustering algorithms and is widely applied into the biological researches. However, the time and space complexity become a great bottleneck when handling the large-scale data sets. Moreover, the similarity matrix, whose constructing procedure takes long runtime, is required before running the affinity propagation algorithm, since the algorithm clusters data sets based on the similarities between data pairs. METHODS Two types of parallel architectures are proposed in this paper to accelerate the similarity matrix constructing procedure and the affinity propagation algorithm. The memory-shared architecture is used to construct the similarity matrix, and the distributed system is taken for the affinity propagation algorithm, because of its large memory size and great computing capacity. An appropriate way of data partition and reduction is designed in our method, in order to minimize the global communication cost among processes. RESULT A speedup of 100 is gained with 128 cores. The runtime is reduced from serval hours to a few seconds, which indicates that parallel algorithm is capable of handling large-scale data sets effectively. The parallel affinity propagation also achieves a good performance when clustering large-scale gene data (microarray) and detecting families in large protein superfamilies.
Collapse
Affiliation(s)
- Minchao Wang
- School of Computer Engineering and Science, Shanghai University, Shanghai, P.R.China
| | - Wu Zhang
- School of Computer Engineering and Science, Shanghai University, Shanghai, P.R.China
- High Performance Computing Center, Shanghai University, Shanghai, P.R.China
| | - Wang Ding
- School of Computer Engineering and Science, Shanghai University, Shanghai, P.R.China
| | - Dongbo Dai
- School of Computer Engineering and Science, Shanghai University, Shanghai, P.R.China
| | - Huiran Zhang
- School of Computer Engineering and Science, Shanghai University, Shanghai, P.R.China
| | - Hao Xie
- College of Stomatology, Wuhan University, Wuhan, P.R.China
| | - Luonan Chen
- School of Computer Engineering and Science, Shanghai University, Shanghai, P.R.China
- Key Laboratory of Systems Biology, Shanghai Institutes for Biological Sciences, Chinese Academy of Sciences, Shanghai, P.R.China
| | - Yike Guo
- School of Computer Engineering and Science, Shanghai University, Shanghai, P.R.China
- Department of Computing, Imperial College London, London, United Kingdom
| | - Jiang Xie
- School of Computer Engineering and Science, Shanghai University, Shanghai, P.R.China
| |
Collapse
|
9
|
Ozcaglar C, Shabbeer A, Kurepina N, Rastogi N, Yener B, Bennett KP. Inferred spoligoforest topology unravels spatially bimodal distribution of mutations in the DR region. IEEE Trans Nanobioscience 2012; 11:191-202. [PMID: 22987125 DOI: 10.1109/tnb.2012.2213265] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/07/2024]
Abstract
Biomarkers of Mycobacterium tuberculosis complex (MTBC) mutate over time. Among the biomarkers of MTBC, spacer oligonucleotide type (spoligotype) and mycobacterium interspersed repetitive unit (MIRU) patterns are commonly used to genotype clinical MTBC strains. In this study, we present an evolution model of spoligotype rearrangements using MIRU patterns to disambiguate the ancestors of spoligotypes. We use a large patient dataset from the United States Centers for Disease Control and Prevention (CDC) to generate this model. Based on the contiguous deletion assumption and rare observation of convergent evolution, we first generate the most parsimonious forest of spoligotypes, called a spoligoforest, using three genetic distance measures. An analysis of topological attributes of the spoligoforest and number of variations at the direct repeat (DR) locus of each strain reveals interesting properties of deletions in the DR region. First, we compare our mutation model to existing mutation models of spoligotypes and find that our mutation model produces as many within-lineage mutation events as other models, with slightly higher segregation accuracy. Second, based on our mutation model, the number of descendant spoligotypes follows a power law distribution. Third, contrary to prior studies, the power law distribution does not plausibly fit to the mutation length frequency. Moreover, we find that the total number of mutation events at consecutive spacers follows a spatially bimodal distribution. The two modes are spacers 13 and 40, which are hotspots for chromosomal rearrangements, and the change point is spacer 34, which is absent in most MTBC strains. Based on this observation, we built two alternative models for mutation length frequency: the Starting Point Model (SPM) and the Longest Block Model (LBM). Both models are plausibly good fits to the mutation length frequency distribution, as verified by the goodness-of-fit test. We also apply SPM and LBM to a dataset from Institut Pasteur de Guadeloupe and verify that these models hold for different strain datasets.
Collapse
Affiliation(s)
- Cagri Ozcaglar
- Computer Science Department, Rensselaer Polytechnic Institute, Troy, NY 12180, USA.
| | | | | | | | | | | |
Collapse
|
10
|
Shabbeer A, Cowan LS, Ozcaglar C, Rastogi N, Vandenberg SL, Yener B, Bennett KP. TB-Lineage: an online tool for classification and analysis of strains of Mycobacterium tuberculosis complex. INFECTION GENETICS AND EVOLUTION 2012; 12:789-97. [PMID: 22406225 DOI: 10.1016/j.meegid.2012.02.010] [Citation(s) in RCA: 70] [Impact Index Per Article: 5.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 08/23/2011] [Revised: 02/18/2012] [Accepted: 02/21/2012] [Indexed: 11/19/2022]
Abstract
This paper formulates a set of rules to classify genotypes of the Mycobacterium tuberculosis complex (MTBC) into major lineages using spoligotypes and MIRU-VNTR results. The rules synthesize prior literature that characterizes lineages by spacer deletions and variations in the number of repeats seen at locus MIRU24 (alias VNTR2687). A tool that efficiently and accurately implements this rule base is now freely available at http://tbinsight.cs.rpi.edu/run_tb_lineage.html. When MIRU24 data is not available, the system utilizes predictions made by a Naïve Bayes classifier based on spoligotype data. This website also provides a tool to generate spoligoforests in order to visualize the genetic diversity and relatedness of genotypes and their associated lineages. A detailed analysis of the application of these tools on a dataset collected by the CDC consisting of 3198 distinct spoligotypes and 5430 distinct MIRU-VNTR types from 37,066 clinical isolates is presented. The tools were also tested on four other independent datasets. The accuracy of automated classification using both spoligotypes and MIRU24 is >99%, and using spoligotypes alone is >95%. This online rule-based classification technique in conjunction with genotype visualization provides a practical tool that supports surveillance of TB transmission trends and molecular epidemiological studies.
Collapse
Affiliation(s)
- Amina Shabbeer
- Computer Science Dept., Rensselaer Polytechnic Institute, Troy, NY, USA.
| | | | | | | | | | | | | |
Collapse
|
11
|
Barrangou R, Horvath P. CRISPR: new horizons in phage resistance and strain identification. Annu Rev Food Sci Technol 2011; 3:143-62. [PMID: 22224556 DOI: 10.1146/annurev-food-022811-101134] [Citation(s) in RCA: 124] [Impact Index Per Article: 9.5] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]
Abstract
Bacteria have been widely used as starter cultures in the food industry, notably for the fermentation of milk into dairy products such as cheese and yogurt. Lactic acid bacteria used in food manufacturing, such as lactobacilli, lactococci, streptococci, Leuconostoc, pediococci, and bifidobacteria, are selectively formulated based on functional characteristics that provide idiosyncratic flavor and texture attributes, as well as their ability to withstand processing and manufacturing conditions. Unfortunately, given frequent viral exposure in industrial environments, starter culture selection and development rely on defense systems that provide resistance against bacteriophage predation, including restriction-modification, abortive infection, and recently discovered CRISPRs (clustered regularly interspaced short palindromic repeats). CRISPRs, together with CRISPR-associated genes (cas), form the CRISPR/Cas immune system, which provides adaptive immunity against phages and invasive genetic elements. The immunization process is based on the incorporation of short DNA sequences from virulent phages into the CRISPR locus. Subsequently, CRISPR transcripts are processed into small interfering RNAs that guide a multifunctional protein complex to recognize and cleave matching foreign DNA. Hypervariable CRISPR loci provide insights into the phage and host population dynamics, and new avenues for enhanced phage resistance and genetic typing and tagging of industrial strains.
Collapse
|
12
|
Shabbeer A, Ozcaglar C, Yener B, Bennett KP. Web tools for molecular epidemiology of tuberculosis. INFECTION GENETICS AND EVOLUTION 2011; 12:767-81. [PMID: 21903179 DOI: 10.1016/j.meegid.2011.08.019] [Citation(s) in RCA: 20] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Received: 06/17/2011] [Revised: 08/14/2011] [Accepted: 08/19/2011] [Indexed: 01/03/2023]
Abstract
In this study we explore publicly available web tools designed to use molecular epidemiological data to extract information that can be employed for the effective tracking and control of tuberculosis (TB). The application of molecular methods for the epidemiology of TB complement traditional approaches used in public health. DNA fingerprinting methods are now routinely employed in TB surveillance programs and are primarily used to detect recent transmissions and in outbreak investigations. Here we present web tools that facilitate systematic analysis of Mycobacterium tuberculosis complex (MTBC) genotype information and provide a view of the genetic diversity in the MTBC population. These tools help answer questions about the characteristics of MTBC strains, such as their pathogenicity, virulence, immunogenicity, transmissibility, drug-resistance profiles and host-pathogen associativity. They provide an integrated platform for researchers to use molecular epidemiological data to address current challenges in the understanding of TB dynamics and the characteristics of MTBC.
Collapse
Affiliation(s)
- Amina Shabbeer
- Department of Mathematical Science, Rensselaer Polytechnic Institute, Troy, NY 12180, USA.
| | | | | | | |
Collapse
|
13
|
Ozcaglar C, Shabbeer A, Kurepina N, Yener B, Bennett KP. Data-driven insights into deletions of Mycobacterium tuberculosis complex chromosomal DR region using spoligoforests. PROCEEDINGS. IEEE INTERNATIONAL CONFERENCE ON BIOINFORMATICS AND BIOMEDICINE 2011:75-82. [PMID: 22343484 PMCID: PMC3279189 DOI: 10.1109/bibm.2011.64] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/31/2023]
Abstract
Biomarkers of Mycobacterium tuberculosis complex (MTBC) mutate over time. Among the biomarkers of MTBC, spacer oligonucleotide type (spoligotype) and Mycobacterium Interspersed Repetitive Unit (MIRU) patterns are commonly used to genotype clinical MTBC strains. In this study, we present an evolution model of spoligotype rearrangements using MIRU patterns to disambiguate the ancestors of spoligotypes, in a large patient dataset from the United States Centers for Disease Control and Prevention (CDC). Based on the contiguous deletion assumption and rare observation of convergent evolution, we first generate the most parsimonious forest of spoligotypes, called a spoligoforest, using three genetic distance measures. An analysis of topological attributes of the spoligoforest and number of variations at the direct repeat (DR) locus of each strain reveals interesting properties of deletions in the DR region. First, we compare our mutation model to existing mutation models of spoligotypes and find that our mutation model produces as many within-lineage mutation events as other models, with slightly higher segregation accuracy. Second, based on our mutation model, the number of descendant spoligotypes follows a power law distribution. Third, contrary to prior studies, the power law distribution does not plausibly fit to the mutation length frequency. Finally, the total number of mutation events at consecutive DR loci follows a bimodal distribution, which results in accumulation of shorter deletions in the DR region. The two modes are spacers 13 and 40, which are hotspots for chromosomal rearrangements. The change point in the bimodal distribution is spacer 34, which is absent in most MTBC strains. This bimodal separation results in accumulation of shorter deletions, which explains why a power law distribution is not a plausible fit to the mutation length frequency.
Collapse
Affiliation(s)
- Cagri Ozcaglar
- Computer Science Department, Rensselaer Polytechnic Institute, Troy, NY
| | - Amina Shabbeer
- Computer Science Department, Rensselaer Polytechnic Institute, Troy, NY
| | | | - Bülent Yener
- Computer Science Department, Rensselaer Polytechnic Institute, Troy, NY
| | - Kristin P. Bennett
- Computer Science Department, Rensselaer Polytechnic Institute, Troy, NY
- Mathematical Sciences Department, Rensselaer Polytechnic Institute, Troy, NY
| |
Collapse
|