1
|
Servant N, Varoquaux N, Lajoie BR, Viara E, Chen CJ, Vert JP, Heard E, Dekker J, Barillot E. HiC-Pro: an optimized and flexible pipeline for Hi-C data processing. Genome Biol 2015; 16:259. [PMID: 26619908 PMCID: PMC4665391 DOI: 10.1186/s13059-015-0831-x] [Citation(s) in RCA: 1550] [Impact Index Per Article: 155.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/10/2015] [Accepted: 11/11/2015] [Indexed: 12/22/2022] Open
Abstract
HiC-Pro is an optimized and flexible pipeline for processing Hi-C data from raw reads to normalized contact maps. HiC-Pro maps reads, detects valid ligation products, performs quality controls and generates intra- and inter-chromosomal contact maps. It includes a fast implementation of the iterative correction method and is based on a memory-efficient data format for Hi-C contact maps. In addition, HiC-Pro can use phased genotype data to build allele-specific contact maps. We applied HiC-Pro to different Hi-C datasets, demonstrating its ability to easily process large data in a reasonable time. Source code and documentation are available at http://github.com/nservant/HiC-Pro.
Collapse
|
Research Support, Non-U.S. Gov't |
10 |
1550 |
2
|
Varoquaux N, Ay F, Noble WS, Vert JP. A statistical approach for inferring the 3D structure of the genome. Bioinformatics 2014; 30:i26-33. [PMID: 24931992 PMCID: PMC4229903 DOI: 10.1093/bioinformatics/btu268] [Citation(s) in RCA: 164] [Impact Index Per Article: 14.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
MOTIVATION Recent technological advances allow the measurement, in a single Hi-C experiment, of the frequencies of physical contacts among pairs of genomic loci at a genome-wide scale. The next challenge is to infer, from the resulting DNA-DNA contact maps, accurate 3D models of how chromosomes fold and fit into the nucleus. Many existing inference methods rely on multidimensional scaling (MDS), in which the pairwise distances of the inferred model are optimized to resemble pairwise distances derived directly from the contact counts. These approaches, however, often optimize a heuristic objective function and require strong assumptions about the biophysics of DNA to transform interaction frequencies to spatial distance, and thereby may lead to incorrect structure reconstruction. METHODS We propose a novel approach to infer a consensus 3D structure of a genome from Hi-C data. The method incorporates a statistical model of the contact counts, assuming that the counts between two loci follow a Poisson distribution whose intensity decreases with the physical distances between the loci. The method can automatically adjust the transfer function relating the spatial distance to the Poisson intensity and infer a genome structure that best explains the observed data. RESULTS We compare two variants of our Poisson method, with or without optimization of the transfer function, to four different MDS-based algorithms-two metric MDS methods using different stress functions, a non-metric version of MDS and ChromSDE, a recently described, advanced MDS method-on a wide range of simulated datasets. We demonstrate that the Poisson models reconstruct better structures than all MDS-based methods, particularly at low coverage and high resolution, and we highlight the importance of optimizing the transfer function. On publicly available Hi-C data from mouse embryonic stem cells, we show that the Poisson methods lead to more reproducible structures than MDS-based methods when we use data generated using different restriction enzymes, and when we reconstruct structures at different resolutions. AVAILABILITY AND IMPLEMENTATION A Python implementation of the proposed method is available at http://cbio.ensmp.fr/pastis.
Collapse
|
Research Support, Non-U.S. Gov't |
11 |
164 |
3
|
Ay F, Bunnik EM, Varoquaux N, Bol SM, Prudhomme J, Vert JP, Noble WS, Le Roch KG. Three-dimensional modeling of the P. falciparum genome during the erythrocytic cycle reveals a strong connection between genome architecture and gene expression. Genome Res 2014; 24:974-88. [PMID: 24671853 PMCID: PMC4032861 DOI: 10.1101/gr.169417.113] [Citation(s) in RCA: 158] [Impact Index Per Article: 14.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/15/2023]
Abstract
The development of the human malaria parasite Plasmodium falciparum is controlled by coordinated changes in gene expression throughout its complex life cycle, but the corresponding regulatory mechanisms are incompletely understood. To study the relationship between genome architecture and gene regulation in Plasmodium, we assayed the genome architecture of P. falciparum at three time points during its erythrocytic (asexual) cycle. Using chromosome conformation capture coupled with next-generation sequencing technology (Hi-C), we obtained high-resolution chromosomal contact maps, which we then used to construct a consensus three-dimensional genome structure for each time point. We observed strong clustering of centromeres, telomeres, ribosomal DNA, and virulence genes, resulting in a complex architecture that cannot be explained by a simple volume exclusion model. Internal virulence gene clusters exhibit domain-like structures in contact maps, suggesting that they play an important role in the genome architecture. Midway during the erythrocytic cycle, at the highly transcriptionally active trophozoite stage, the genome adopts a more open chromatin structure with increased chromosomal intermingling. In addition, we observed reduced expression of genes located in spatial proximity to the repressive subtelomeric center, and colocalization of distinct groups of parasite-specific genes with coordinated expression profiles. Overall, our results are indicative of a strong association between the P. falciparum spatial genome organization and gene expression. Understanding the molecular processes involved in genome conformation dynamics could contribute to the discovery of novel antimalarial strategies.
Collapse
|
Research Support, U.S. Gov't, Non-P.H.S. |
11 |
158 |
4
|
Xu L, Dong Z, Chiniquy D, Pierroz G, Deng S, Gao C, Diamond S, Simmons T, Wipf HML, Caddell D, Varoquaux N, Madera MA, Hutmacher R, Deutschbauer A, Dahlberg JA, Guerinot ML, Purdom E, Banfield JF, Taylor JW, Lemaux PG, Coleman-Derr D. Genome-resolved metagenomics reveals role of iron metabolism in drought-induced rhizosphere microbiome dynamics. Nat Commun 2021; 12:3209. [PMID: 34050180 PMCID: PMC8163885 DOI: 10.1038/s41467-021-23553-7] [Citation(s) in RCA: 93] [Impact Index Per Article: 23.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/28/2020] [Accepted: 04/27/2021] [Indexed: 02/04/2023] Open
Abstract
Recent studies have demonstrated that drought leads to dramatic, highly conserved shifts in the root microbiome. At present, the molecular mechanisms underlying these responses remain largely uncharacterized. Here we employ genome-resolved metagenomics and comparative genomics to demonstrate that carbohydrate and secondary metabolite transport functionalities are overrepresented within drought-enriched taxa. These data also reveal that bacterial iron transport and metabolism functionality is highly correlated with drought enrichment. Using time-series root RNA-Seq data, we demonstrate that iron homeostasis within the root is impacted by drought stress, and that loss of a plant phytosiderophore iron transporter impacts microbial community composition, leading to significant increases in the drought-enriched lineage, Actinobacteria. Finally, we show that exogenous application of iron disrupts the drought-induced enrichment of Actinobacteria, as well as their improvement in host phenotype during drought stress. Collectively, our findings implicate iron metabolism in the root microbiome's response to drought and may inform efforts to improve plant drought tolerance to increase food security.
Collapse
|
research-article |
4 |
93 |
5
|
Bunnik EM, Cook KB, Varoquaux N, Batugedara G, Prudhomme J, Cort A, Shi L, Andolina C, Ross LS, Brady D, Fidock DA, Nosten F, Tewari R, Sinnis P, Ay F, Vert JP, Noble WS, Le Roch KG. Changes in genome organization of parasite-specific gene families during the Plasmodium transmission stages. Nat Commun 2018; 9:1910. [PMID: 29765020 PMCID: PMC5954139 DOI: 10.1038/s41467-018-04295-5] [Citation(s) in RCA: 70] [Impact Index Per Article: 10.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/14/2017] [Accepted: 04/18/2018] [Indexed: 12/20/2022] Open
Abstract
The development of malaria parasites throughout their various life cycle stages is coordinated by changes in gene expression. We previously showed that the three-dimensional organization of the Plasmodium falciparum genome is strongly associated with gene expression during its replication cycle inside red blood cells. Here, we analyze genome organization in the P. falciparum and P. vivax transmission stages. Major changes occur in the localization and interactions of genes involved in pathogenesis and immune evasion, host cell invasion, sexual differentiation, and master regulation of gene expression. Furthermore, we observe reorganization of subtelomeric heterochromatin around genes involved in host cell remodeling. Depletion of heterochromatin protein 1 (PfHP1) resulted in loss of interactions between virulence genes, confirming that PfHP1 is essential for maintenance of the repressive center. Our results suggest that the three-dimensional genome structure of human malaria parasites is strongly connected with transcriptional activity of specific gene families throughout the life cycle.
Collapse
|
Research Support, N.I.H., Extramural |
7 |
70 |
6
|
Varoquaux N, Liachko I, Ay F, Burton JN, Shendure J, Dunham MJ, Vert JP, Noble WS. Accurate identification of centromere locations in yeast genomes using Hi-C. Nucleic Acids Res 2015; 43:5331-9. [PMID: 25940625 PMCID: PMC4477656 DOI: 10.1093/nar/gkv424] [Citation(s) in RCA: 48] [Impact Index Per Article: 4.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/19/2014] [Accepted: 04/17/2015] [Indexed: 11/16/2022] Open
Abstract
Centromeres are essential for proper chromosome segregation. Despite extensive research, centromere locations in yeast genomes remain difficult to infer, and in most species they are still unknown. Recently, the chromatin conformation capture assay, Hi-C, has been re-purposed for diverse applications, including de novo genome assembly, deconvolution of metagenomic samples and inference of centromere locations. We describe a method, Centurion, that jointly infers the locations of all centromeres in a single genome from Hi-C data by exploiting the centromeres’ tendency to cluster in three-dimensional space. We first demonstrate the accuracy of Centurion in identifying known centromere locations from high coverage Hi-C data of budding yeast and a human malaria parasite. We then use Centurion to infer centromere locations in 14 yeast species. Across all microbes that we consider, Centurion predicts 89% of centromeres within 5 kb of their known locations. We also demonstrate the robustness of the approach in datasets with low sequencing depth. Finally, we predict centromere coordinates for six yeast species that currently lack centromere annotations. These results show that Centurion can be used for centromere identification for diverse species of yeast and possibly other microorganisms.
Collapse
|
Validation Study |
10 |
48 |
7
|
Ay F, Bunnik EM, Varoquaux N, Vert JP, Noble WS, Le Roch KG. Multiple dimensions of epigenetic gene regulation in the malaria parasite Plasmodium falciparum: gene regulation via histone modifications, nucleosome positioning and nuclear architecture in P. falciparum. Bioessays 2014; 37:182-94. [PMID: 25394267 DOI: 10.1002/bies.201400145] [Citation(s) in RCA: 42] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022]
Abstract
Plasmodium falciparum is the most deadly human malarial parasite, responsible for an estimated 207 million cases of disease and 627,000 deaths in 2012. Recent studies reveal that the parasite actively regulates a large fraction of its genes throughout its replicative cycle inside human red blood cells and that epigenetics plays an important role in this precise gene regulation. Here, we discuss recent advances in our understanding of three aspects of epigenetic regulation in P. falciparum: changes in histone modifications, nucleosome occupancy and the three-dimensional genome structure. We compare these three aspects of the P. falciparum epigenome to those of other eukaryotes, and show that large-scale compartmentalization is particularly important in determining histone decomposition and gene regulation in P. falciparum. We conclude by presenting a gene regulation model for P. falciparum that combines the described epigenetic factors, and by discussing the implications of this model for the future of malaria research.
Collapse
|
Research Support, Non-U.S. Gov't |
11 |
42 |
8
|
Ay F, Vu TH, Zeitz MJ, Varoquaux N, Carette JE, Vert JP, Hoffman AR, Noble WS. Identifying multi-locus chromatin contacts in human cells using tethered multiple 3C. BMC Genomics 2015; 16:121. [PMID: 25887659 PMCID: PMC4369351 DOI: 10.1186/s12864-015-1236-7] [Citation(s) in RCA: 41] [Impact Index Per Article: 4.1] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/10/2014] [Accepted: 01/12/2015] [Indexed: 12/02/2022] Open
Abstract
Background Several recently developed experimental methods, each an extension of the chromatin conformation capture (3C) assay, have enabled the genome-wide profiling of chromatin contacts between pairs of genomic loci in 3D. Especially in complex eukaryotes, data generated by these methods, coupled with other genome-wide datasets, demonstrated that non-random chromatin folding correlates strongly with cellular processes such as gene expression and DNA replication. Results We describe a genome architecture assay, tethered multiple 3C (TM3C), that maps genome-wide chromatin contacts via a simple protocol of restriction enzyme digestion and religation of fragments upon agarose gel beads followed by paired-end sequencing. In addition to identifying contacts between pairs of loci, TM3C enables identification of contacts among more than two loci simultaneously. We use TM3C to assay the genome architectures of two human cell lines: KBM7, a near-haploid chronic leukemia cell line, and NHEK, a normal diploid human epidermal keratinocyte cell line. We confirm that the contact frequency maps produced by TM3C exhibit features characteristic of existing genome architecture datasets, including the expected scaling of contact probabilities with genomic distance, megabase scale chromosomal compartments and sub-megabase scale topological domains. We also confirm that TM3C captures several known cell type-specific contacts, ploidy shifts and translocations, such as Philadelphia chromosome formation (Ph+) in KBM7. We confirm a subset of the triple contacts involving the IGF2-H19 imprinting control region (ICR) using PCR analysis for KBM7 cells. Our genome-wide analysis of pairwise and triple contacts demonstrates their preference for linking open chromatin regions to each other and for linking regions with higher numbers of DNase hypersensitive sites (DHSs) to each other. For near-haploid KBM7 cells, we infer whole genome 3D models that exhibit clustering of small chromosomes with each other and large chromosomes with each other, consistent with previous studies of the genome architectures of other human cell lines. Conclusion TM3C is a simple protocol for ascertaining genome architecture and can be used to identify simultaneous contacts among three or four loci. Application of TM3C to a near-haploid human cell line revealed large-scale features of chromosomal organization and multi-way chromatin contacts that preferentially link regions of open chromatin. Electronic supplementary material The online version of this article (doi:10.1186/s12864-015-1236-7) contains supplementary material, which is available to authorized users.
Collapse
|
Research Support, U.S. Gov't, Non-P.H.S. |
10 |
41 |
9
|
Lioy VS, Lorenzi JN, Najah S, Poinsignon T, Leh H, Saulnier C, Aigle B, Lautru S, Thibessard A, Lespinet O, Leblond P, Jaszczyszyn Y, Gorrichon K, Varoquaux N, Junier I, Boccard F, Pernodet JL, Bury-Moné S. Dynamics of the compartmentalized Streptomyces chromosome during metabolic differentiation. Nat Commun 2021; 12:5221. [PMID: 34471117 PMCID: PMC8410849 DOI: 10.1038/s41467-021-25462-1] [Citation(s) in RCA: 35] [Impact Index Per Article: 8.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/12/2021] [Accepted: 07/21/2021] [Indexed: 02/07/2023] Open
Abstract
Bacteria of the genus Streptomyces are prolific producers of specialized metabolites, including antibiotics. The linear chromosome includes a central region harboring core genes, as well as extremities enriched in specialized metabolite biosynthetic gene clusters. Here, we show that chromosome structure in Streptomyces ambofaciens correlates with genetic compartmentalization during exponential phase. Conserved, large and highly transcribed genes form boundaries that segment the central part of the chromosome into domains, whereas the terminal ends tend to be transcriptionally quiescent compartments with different structural features. The onset of metabolic differentiation is accompanied by a rearrangement of chromosome architecture, from a rather 'open' to a 'closed' conformation, in which highly expressed specialized metabolite biosynthetic genes form new boundaries. Thus, our results indicate that the linear chromosome of S. ambofaciens is partitioned into structurally distinct entities, suggesting a link between chromosome folding, gene expression and genome evolution.
Collapse
|
research-article |
4 |
35 |
10
|
Servant N, Varoquaux N, Heard E, Barillot E, Vert JP. Effective normalization for copy number variation in Hi-C data. BMC Bioinformatics 2018; 19:313. [PMID: 30189838 PMCID: PMC6127909 DOI: 10.1186/s12859-018-2256-5] [Citation(s) in RCA: 23] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/22/2017] [Accepted: 06/20/2018] [Indexed: 12/28/2022] Open
Abstract
BACKGROUND Normalization is essential to ensure accurate analysis and proper interpretation of sequencing data, and chromosome conformation capture data such as Hi-C have particular challenges. Although several methods have been proposed, the most widely used type of normalization of Hi-C data usually casts estimation of unwanted effects as a matrix balancing problem, relying on the assumption that all genomic regions interact equally with each other. RESULTS In order to explore the effect of copy-number variations on Hi-C data normalization, we first propose a simulation model that predict the effects of large copy-number changes on a diploid Hi-C contact map. We then show that the standard approaches relying on equal visibility fail to correct for unwanted effects in the presence of copy-number variations. We thus propose a simple extension to matrix balancing methods that model these effects. Our approach can either retain the copy-number variation effects (LOIC) or remove them (CAIC). We show that this leads to better downstream analysis of the three-dimensional organization of rearranged genomes. CONCLUSIONS Taken together, our results highlight the importance of using dedicated methods for the analysis of Hi-C cancer data. Both CAIC and LOIC methods perform well on simulated and real Hi-C data sets, each fulfilling different needs.
Collapse
|
research-article |
7 |
23 |
11
|
Geiger RS, Varoquaux N, Mazel-Cabasse C, Holdgraf C. The Types, Roles, and Practices of Documentation in Data Analytics Open Source Software Libraries. Comput Support Coop Work 2018. [DOI: 10.1007/s10606-018-9333-1] [Citation(s) in RCA: 16] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/27/2022]
|
|
7 |
16 |
12
|
Gao C, Courty PE, Varoquaux N, Cole B, Montoya L, Xu L, Purdom E, Vogel J, Hutmacher RB, Dahlberg JA, Coleman-Derr D, Lemaux PG, Taylor JW. Successional adaptive strategies revealed by correlating arbuscular mycorrhizal fungal abundance with host plant gene expression. Mol Ecol 2022; 32:2674-2687. [PMID: 35000239 DOI: 10.1111/mec.16343] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/24/2021] [Revised: 12/02/2021] [Accepted: 12/23/2021] [Indexed: 11/28/2022]
Abstract
The shifts in adaptive strategies revealed by ecological succession and the mechanisms that facilitate these shifts are fundamental to ecology. These adaptive strategies could be particularly important in communities of arbuscular mycorrhizal fungi (AMF) mutualistic with sorghum where strong AMF succession replaces initially ruderal species with competitive ones and where the strongest plant response to drought is to manage these AMF. Although most studies of agriculturally important fungi focus on parasites, the mutualistic symbionts, AMF, constitute a research system of human-associated fungi whose relative simplicity and synchrony are conducive to experimental ecology. First, we hypothesize that, when irrigation is stopped to mimic drought, competitive AMF species should be replaced by AMF species tolerant to drought stress. We then, for the first time, correlate AMF abundance and host plant transcription to test two novel hypotheses about the mechanisms behind the shift from ruderal to competitive AMF. Surprisingly, despite imposing drought stress, we found no stress tolerant AMF, likely due to our agricultural system having been irrigated for nearly six decades. Remarkably, we found strong and differential correlation between the successional shift from ruderal to competitive AMF and sorghum genes whose products (i) produce and release strigolactone signals, (ii) perceive mycorrhizal-lipochitinoligosaccharide (Myc-LCO) signals, (iii) provide plant lipid and sugar to AMF and, (iv) import minerals and water provided by AMF. These novel insights frame new hypotheses about AMF adaptive evolution and suggest a rationale for selecting AMF to reduce inputs and maximize yields in commercial agriculture.
Collapse
|
|
3 |
8 |
13
|
Varoquaux N, Noble WS, Vert JP. Inference of 3D genome architecture by modeling overdispersion of Hi-C data. Bioinformatics 2023; 39:btac838. [PMID: 36594573 PMCID: PMC9857972 DOI: 10.1093/bioinformatics/btac838] [Citation(s) in RCA: 7] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/25/2022] [Revised: 11/16/2022] [Indexed: 01/04/2023] Open
Abstract
MOTIVATION We address the challenge of inferring a consensus 3D model of genome architecture from Hi-C data. Existing approaches most often rely on a two-step algorithm: first, convert the contact counts into distances, then optimize an objective function akin to multidimensional scaling (MDS) to infer a 3D model. Other approaches use a maximum likelihood approach, modeling the contact counts between two loci as a Poisson random variable whose intensity is a decreasing function of the distance between them. However, a Poisson model of contact counts implies that the variance of the data is equal to the mean, a relationship that is often too restrictive to properly model count data. RESULTS We first confirm the presence of overdispersion in several real Hi-C datasets, and we show that the overdispersion arises even in simulated datasets. We then propose a new model, called Pastis-NB, where we replace the Poisson model of contact counts by a negative binomial one, which is parametrized by a mean and a separate dispersion parameter. The dispersion parameter allows the variance to be adjusted independently from the mean, thus better modeling overdispersed data. We compare the results of Pastis-NB to those of several previously published algorithms, both MDS-based and statistical methods. We show that the negative binomial inference yields more accurate structures on simulated data, and more robust structures than other models across real Hi-C replicates and across different resolutions. AVAILABILITY AND IMPLEMENTATION A Python implementation of Pastis-NB is available at https://github.com/hiclib/pastis under the BSD license. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
|
Research Support, N.I.H., Extramural |
2 |
7 |
14
|
Kazemzadeh K, Pelosi L, Chenal C, Chobert SC, Hajj Chehade M, Jullien M, Flandrin L, Schmitt W, He Q, Bouvet E, Jarzynka M, Varoquaux N, Junier I, Pierrel F, Abby SS. Diversification of Ubiquinone Biosynthesis via Gene Duplications, Transfers, Losses, and Parallel Evolution. Mol Biol Evol 2023; 40:msad219. [PMID: 37788637 PMCID: PMC10597321 DOI: 10.1093/molbev/msad219] [Citation(s) in RCA: 5] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/26/2023] [Revised: 09/11/2023] [Accepted: 09/26/2023] [Indexed: 10/05/2023] Open
Abstract
The availability of an ever-increasing diversity of prokaryotic genomes and metagenomes represents a major opportunity to understand and decipher the mechanisms behind the functional diversification of microbial biosynthetic pathways. However, it remains unclear to what extent a pathway producing a specific molecule from a specific precursor can diversify. In this study, we focus on the biosynthesis of ubiquinone (UQ), a crucial coenzyme that is central to the bioenergetics and to the functioning of a wide variety of enzymes in Eukarya and Pseudomonadota (a subgroup of the formerly named Proteobacteria). UQ biosynthesis involves three hydroxylation reactions on contiguous carbon atoms. We and others have previously shown that these reactions are catalyzed by different sets of UQ-hydroxylases that belong either to the iron-dependent Coq7 family or to the more widespread flavin monooxygenase (FMO) family. Here, we combine an experimental approach with comparative genomics and phylogenetics to reveal how UQ-hydroxylases evolved different selectivities within the constrained framework of the UQ pathway. It is shown that the UQ-FMOs diversified via at least three duplication events associated with two cases of neofunctionalization and one case of subfunctionalization, leading to six subfamilies with distinct hydroxylation selectivity. We also demonstrate multiple transfers of the UbiM enzyme and the convergent evolution of UQ-FMOs toward the same function, which resulted in two independent losses of the Coq7 ancestral enzyme. Diversification of this crucial biosynthetic pathway has therefore occurred via a combination of parallel evolution, gene duplications, transfers, and losses.
Collapse
|
research-article |
2 |
5 |
15
|
Varoquaux N, Lioy VS, Boccard F, Junier I. Computational Tools for the Multiscale Analysis of Hi-C Data in Bacterial Chromosomes. Methods Mol Biol 2022; 2301:197-207. [PMID: 34415537 DOI: 10.1007/978-1-0716-1390-0_10] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/13/2023]
Abstract
Just as in eukaryotes, high-throughput chromosome conformation capture (Hi-C) data have revealed nested organizations of bacterial chromosomes into overlapping interaction domains. In this chapter, we present a multiscale analysis framework aiming at capturing and quantifying these properties. These include both standard tools (e.g., contact laws) and novel ones such as an index that allows identifying loci involved in domain formation independently of the structuring scale at play. Our objective is twofold. On the one hand, we aim at providing a full, understandable Python/Jupyter-based code which can be used by both computer scientists and biologists with no advanced computational background. On the other hand, we discuss statistical issues inherent to Hi-C data analysis, focusing more particularly on how to properly assess the statistical significance of results. As a pedagogical example, we analyze data produced in Pseudomonas aeruginosa, a model pathogenetic bacterium. All files (codes and input data) can be found on a GitHub repository. We have also embedded the files into a Binder package so that the full analysis can be run on any machine through Internet.
Collapse
|
|
3 |
1 |
16
|
Scavuzzo-Duggan T, Varoquaux N, Madera M, Vogel JP, Dahlberg J, Hutmacher R, Belcher M, Ortega J, Coleman-Derr D, Lemaux P, Purdom E, Scheller HV. Cell Wall Compositions of Sorghum bicolor Leaves and Roots Remain Relatively Constant Under Drought Conditions. FRONTIERS IN PLANT SCIENCE 2021; 12:747225. [PMID: 34868130 PMCID: PMC8632824 DOI: 10.3389/fpls.2021.747225] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 07/25/2021] [Accepted: 10/15/2021] [Indexed: 06/13/2023]
Abstract
Renewable fuels are needed to replace fossil fuels in the immediate future. Lignocellulosic bioenergy crops provide a renewable alternative that sequesters atmospheric carbon. To prevent displacement of food crops, it would be advantageous to grow biofuel crops on marginal lands. These lands will likely face more frequent and extreme drought conditions than conventional agricultural land, so it is crucial to see how proposed bioenergy crops fare under these conditions and how that may affect lignocellulosic biomass composition and saccharification properties. We found that while drought impacts the plant cell wall of Sorghum bicolor differently according to tissue and timing of drought induction, drought-induced cell wall compositional modifications are relatively minor and produce no negative effect on biomass conversion. This contrasts with the cell wall-related transcriptome, which had a varied range of highly variable genes (HVGs) within four cell wall-related GO categories, depending on the tissues surveyed and time of drought induction. Further, many HVGs had expression changes in which putative impacts were not seen in the physical cell wall or which were in opposition to their putative impacts. Interestingly, most pre-flowering drought-induced cell wall changes occurred in the leaf, with matrix and lignin compositional changes that did not persist after recovery from drought. Most measurable physical post-flowering cell wall changes occurred in the root, affecting mainly polysaccharide composition and cross-linking. This study couples transcriptomics to cell wall chemical analyses of a C4 grass experiencing progressive and differing drought stresses in the field. As such, we can analyze the cell wall-specific response to agriculturally relevant drought stresses on the transcriptomic level and see whether those changes translate to compositional or biomass conversion differences. Our results bolster the conclusion that drought stress does not substantially affect the cell wall composition of specific aerial and subterranean biomass nor impede enzymatic hydrolysis of leaf biomass, a positive result for biorefinery processes. Coupled with previously reported results on the root microbiome and rhizosphere and whole transcriptome analyses of this study, we can formulate and test hypotheses on individual gene candidates' function in mediating drought stress in the grass cell wall, as demonstrated in sorghum.
Collapse
|
research-article |
4 |
1 |
17
|
Etourneau L, Varoquaux N, Burger T. Unveiling the Links Between Peptide Identification and Differential Analysis FDR Controls by Means of a Practical Introduction to Knockoff Filters. Methods Mol Biol 2023; 2426:1-24. [PMID: 36308682 DOI: 10.1007/978-1-0716-1967-4_1] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/16/2023]
Abstract
In proteomic differential analysis, FDR control is often performed through a multiple test correction (i.e., the adjustment of the original p-values). In this protocol, we apply a recent and alternative method, based on so-called knockoff filters. It shares interesting conceptual similarities with the target-decoy competition procedure, classically used in proteomics for FDR control at peptide identification. To provide practitioners with a unified understanding of FDR control in proteomics, we apply the knockoff procedure on real and simulated quantitative datasets. Leveraging these comparisons, we propose to adapt the knockoff procedure to better fit the specificities of quantitative proteomic data (mainly very few samples). Performances of knockoff procedure are compared with those of the classical Benjamini-Hochberg procedure, hereby shedding a new light on the strengths and weaknesses of target-decoy competition.
Collapse
|
|
2 |
1 |
18
|
Chen P, Yu Q, Wang C, Montoya L, West PT, Xu L, Varoquaux N, Cole B, Hixson KK, Kim YM, Liu L, Zhang B, Zhang J, Li B, Purdom E, Vogel J, Jansson C, Hutmacher RB, Dahlberg JA, Coleman-Derr D, Lemaux PG, Taylor JW, Gao C. Holo-omics disentangle drought response and biotic interactions among plant, endophyte and pathogen. THE NEW PHYTOLOGIST 2025; 246:2702-2717. [PMID: 40247824 DOI: 10.1111/nph.70155] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 12/04/2024] [Accepted: 03/30/2025] [Indexed: 04/19/2025]
Abstract
Holo-omics provide a novel opportunity to study the interactions among fungi from different functional guilds in host plants in field conditions. We address the entangled responses of plant pathogenic and endophytic fungi associated with sorghum when droughted through the assembly of the most abundant fungal, endophyte genome from rhizospheric metagenomic sequences followed by a comparison of its metatranscriptome with the host plant metabolome and transcriptome. The rise in relative abundance of endophytic Acremonium persicinum (operational taxonomic unit 5 (OTU5)) in drought co-occurs with a rise in fungal membrane dynamics and plant metabolites, led by ethanolamine, a key phospholipid membrane component. The negative association between endophytic A. persicinum (OTU5) and plant pathogenic fungi co-occurs with a rise in expression of the endophyte's biosynthetic gene clusters coding for secondary compounds. Endophytic A. persicinum (OTU5) and plant pathogenic fungi are negatively associated under preflowering drought but not under postflowering drought, likely a consequence of variation in fungal fitness responses to changes in the availability of water and niche space caused by plant maturation over the growing season. Our findings suggest that the dynamic biotic interactions among host, beneficial and harmful microbiota in a changing environment can be disentangled by a blending of field observation, laboratory validation, holo-omics and ecological modelling.
Collapse
|
|
1 |
|
19
|
Varoquaux N. Unfolding the Genome: The Case Study of P. falciparum. Int J Biostat 2018; 15:ijb-2017-0061. [PMID: 29878883 DOI: 10.1515/ijb-2017-0061] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/01/2017] [Accepted: 05/10/2018] [Indexed: 11/15/2022]
Abstract
The development of new ways to probe samples for the three-dimensional (3D) structure of DNA paves the way for in depth and systematic analyses of the genome architecture. 3C-like methods coupled with high-throughput sequencing can now assess physical interactions between pairs of loci in a genome-wide fashion, thus enabling the creation of genome-by-genome contact maps. The spreading of such protocols creates many new opportunities for methodological development: how can we infer 3D models from these contact maps? Can such models help us gain insights into biological processes? Several recent studies applied such protocols to P. falciparum (the deadliest of the five human malaria parasites), assessing its genome organization at different moments of its life cycle. With its small genomic size, fairly simple (yet changing) genomic organization during its lifecyle and strong correlation between chromatin folding and gene expression, this parasite is the ideal case study for applying and developing methods to infer 3D models and use them for downstream analysis. Here, I review a set of methods used to build and analyse three-dimensional models from contact maps data with a special highlight on P. falciparum's genome organization.
Collapse
|
|
7 |
|
20
|
Etourneau L, Fancello L, Wieczorek S, Varoquaux N, Burger T. Penalized likelihood optimization for censored missing value imputation in proteomics. Biostatistics 2024; 26:kxaf006. [PMID: 40120089 DOI: 10.1093/biostatistics/kxaf006] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/25/2024] [Revised: 01/31/2025] [Accepted: 02/03/2025] [Indexed: 03/25/2025] Open
Abstract
Label-free bottom-up proteomics using mass spectrometry and liquid chromatography has long been established as one of the most popular high-throughput analysis workflows for proteome characterization. However, it produces data hindered by complex and heterogeneous missing values, which imputation has long remained problematic. To cope with this, we introduce Pirat, an algorithm that harnesses this challenge using an original likelihood maximization strategy. Notably, it models the instrument limit by learning a global censoring mechanism from the data available. Moreover, it estimates the covariance matrix between enzymatic cleavage products (ie peptides or precursor ions), while offering a natural way to integrate complementary transcriptomic information when multi-omic assays are available. Our benchmarking on several datasets covering a variety of experimental designs (number of samples, acquisition mode, missingness patterns, etc.) and using a variety of metrics (differential analysis ground truth or imputation errors) shows that Pirat outperforms all pre-existing imputation methods. Beyond the interest of Pirat as an imputation tool, these results pinpoint the need for a paradigm change in proteomics imputation, as most pre-existing strategies could be boosted by incorporating similar models to account for the instrument censorship or for the correlation structures, either grounded to the analytical pipeline or arising from a multi-omic approach.
Collapse
|
|
1 |
|
21
|
Chobert SC, Roger-Margueritat M, Flandrin L, Berraies S, Lefèvre CT, Pelosi L, Junier I, Varoquaux N, Pierrel F, Abby SS. Dynamic quinone repertoire accompanied the diversification of energy metabolism in Pseudomonadota. THE ISME JOURNAL 2025; 19:wrae253. [PMID: 39693360 PMCID: PMC11707229 DOI: 10.1093/ismejo/wrae253] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 07/29/2024] [Revised: 10/27/2024] [Accepted: 12/17/2024] [Indexed: 12/20/2024]
Abstract
It is currently unclear how Pseudomonadota, a phylum that originated around the time of the Great Oxidation Event, became one of the most abundant and diverse bacterial phyla on Earth, with metabolically versatile members colonizing a wide range of environments with different O2 concentrations. Here, we address this question by studying isoprenoid quinones, which are central components of energy metabolism covering a wide range of redox potentials. We demonstrate that a dynamic repertoire of quinone biosynthetic pathways accompanied the diversification of Pseudomonadota. The low potential menaquinone (MK) was lost in an ancestor of Pseudomonadota while the high potential ubiquinone (UQ) emerged. We show that the O2-dependent and O2-independent UQ pathways were both present in the last common ancestor of Pseudomonadota, and transmitted vertically. The O2-independent pathway has a conserved genetic organization and displays signs of positive regulation by the master regulator "fumarate and nitrate reductase" (FNR), suggesting a conserved role for UQ in anaerobiosis across Pseudomonadota. The O2-independent pathway was lost in some lineages but maintained in others, where it favoured a secondary reacquisition of low potential quinones (MK or rhodoquinone), which promoted diversification towards aerobic facultative and anaerobic metabolisms. Our results support that the ecological success of Pseudomonadota is linked to the acquisition of the largest known repertoire of quinones, which allowed adaptation to oxic niches as O2 levels increased on Earth, and subsequent diversification into anoxic or O2-fluctuating environments.
Collapse
|
research-article |
1 |
|
22
|
Paxton A, Varoquaux N, Holdgraf C, Geiger RS. Community, Time, and (Con)text: A Dynamical Systems Analysis of Online Communication and Community Health among Open-Source Software Communities. Cogn Sci 2022; 46:e13134. [PMID: 35579857 PMCID: PMC9287033 DOI: 10.1111/cogs.13134] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/23/2020] [Revised: 02/24/2022] [Accepted: 03/06/2022] [Indexed: 11/28/2022]
Abstract
Free and open‐source software projects have become essential digital infrastructure over the past decade. These projects are largely created and maintained by unpaid volunteers, presenting a potential vulnerability if the projects cannot recruit and retain new volunteers. At the same time, their development on open collaborative development platforms provides a nearly complete record of the community's interactions; this affords the opportunity to study naturally occurring language dynamics at scale and in a context with massive real‐world impact. The present work takes a dynamical systems view of language to understand the ways in which communicative context and community membership shape the emergence and impact of language use—specifically, sentiment and expressions of gratitude. We then present evidence that these language dynamics shape newcomers' likelihood of returning, although the specific impacts of different community responses are crucially modulated by the context of the newcomer's first contact with the community.
Collapse
|
|
3 |
|