1
|
Cremona MA, Chiaromonte F. Probabilistic K-means with local alignment for clustering and motif discovery in functional data. J Comput Graph Stat 2022. [DOI: 10.1080/10618600.2022.2156522] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/14/2022]
Affiliation(s)
- Marzia A. Cremona
- Dept. of Operations and Decision Systems, Université Laval, CHU de Québec – Université Laval Research Center
| | - Francesca Chiaromonte
- Dept. of Statistics, The Pennsylvania State University, Inst. of Economics and EMbeDS, Sant’Anna School of Advanced Studies
| |
Collapse
|
2
|
Spencer Chapman M, Ranzoni AM, Myers B, Williams N, Coorens THH, Mitchell E, Butler T, Dawson KJ, Hooks Y, Moore L, Nangalia J, Robinson PS, Yoshida K, Hook E, Campbell PJ, Cvejic A. Lineage tracing of human development through somatic mutations. Nature 2021; 595:85-90. [PMID: 33981037 DOI: 10.1038/s41586-021-03548-6] [Citation(s) in RCA: 87] [Impact Index Per Article: 21.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/29/2020] [Accepted: 04/13/2021] [Indexed: 12/21/2022]
Abstract
The ontogeny of the human haematopoietic system during fetal development has previously been characterized mainly through careful microscopic observations1. Here we reconstruct a phylogenetic tree of blood development using whole-genome sequencing of 511 single-cell-derived haematopoietic colonies from healthy human fetuses at 8 and 18 weeks after conception, coupled with deep targeted sequencing of tissues of known embryonic origin. We found that, in healthy fetuses, individual haematopoietic progenitors acquire tens of somatic mutations by 18 weeks after conception. We used these mutations as barcodes and timed the divergence of embryonic and extra-embryonic tissues during development, and estimated the number of blood antecedents at different stages of embryonic development. Our data support a hypoblast origin of the extra-embryonic mesoderm and primitive blood in humans.
Collapse
Affiliation(s)
- Michael Spencer Chapman
- Wellcome Trust Sanger Institute, Hinxton, UK
- Department of Haematology, Hammersmith Hospital, Imperial College Healthcare NHS Trust, London, UK
- Department of Haematology, Cambridge University Hospitals NHS Foundation Trust, Cambridge, UK
| | - Anna Maria Ranzoni
- Wellcome Trust Sanger Institute, Hinxton, UK
- Wellcome Trust-Medical Research Council Cambridge Stem Cell Institute, Cambridge, UK
- Department of Haematology, University of Cambridge, Cambridge, UK
| | - Brynelle Myers
- Wellcome Trust Sanger Institute, Hinxton, UK
- Wellcome Trust-Medical Research Council Cambridge Stem Cell Institute, Cambridge, UK
- Department of Haematology, University of Cambridge, Cambridge, UK
| | | | | | - Emily Mitchell
- Wellcome Trust Sanger Institute, Hinxton, UK
- Department of Haematology, Cambridge University Hospitals NHS Foundation Trust, Cambridge, UK
- Wellcome Trust-Medical Research Council Cambridge Stem Cell Institute, Cambridge, UK
| | | | | | | | - Luiza Moore
- Wellcome Trust Sanger Institute, Hinxton, UK
- Department of Histopathology, Cambridge University Hospitals NHS Foundation Trust, Cambridge, UK
| | - Jyoti Nangalia
- Wellcome Trust Sanger Institute, Hinxton, UK
- Wellcome Trust-Medical Research Council Cambridge Stem Cell Institute, Cambridge, UK
- Department of Haematology, University of Cambridge, Cambridge, UK
| | - Philip S Robinson
- Wellcome Trust Sanger Institute, Hinxton, UK
- Department of Paediatrics, University of Cambridge, Cambridge, UK
| | | | - Elizabeth Hook
- Department of Histopathology, Cambridge University Hospitals NHS Foundation Trust, Cambridge, UK
| | - Peter J Campbell
- Wellcome Trust Sanger Institute, Hinxton, UK.
- Wellcome Trust-Medical Research Council Cambridge Stem Cell Institute, Cambridge, UK.
| | - Ana Cvejic
- Wellcome Trust Sanger Institute, Hinxton, UK.
- Wellcome Trust-Medical Research Council Cambridge Stem Cell Institute, Cambridge, UK.
- Department of Haematology, University of Cambridge, Cambridge, UK.
| |
Collapse
|
3
|
Antoine-Lorquin A, Arensburger P, Arnaoty A, Asgari S, Batailler M, Beauclair L, Belleannée C, Buisine N, Coustham V, Guyetant S, Helou L, Lecomte T, Pitard B, Stévant I, Bigot Y. Two repeated motifs enriched within some enhancers and origins of replication are bound by SETMAR isoforms in human colon cells. Genomics 2021; 113:1589-1604. [PMID: 33812898 DOI: 10.1016/j.ygeno.2021.03.032] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/04/2020] [Revised: 03/25/2021] [Accepted: 03/30/2021] [Indexed: 11/15/2022]
Abstract
Setmar is a gene specific to simian genomes. The function(s) of its isoforms are poorly understood and their existence in healthy tissues remains to be validated. Here we profiled SETMAR expression and its genome-wide binding landscape in colon tissue. We found isoforms V3 and V6 in healthy and tumour colon tissues as well as incell lines. In two colorectal cell lines SETMAR binds to several thousand Hsmar1 and MADE1 terminal ends, transposons mostly located in non-genic regions of active chromatin including in enhancers. It also binds to a 12-bp motifs similar to an inner motif in Hsmar1 and MADE1 terminal ends. This motif is interspersed throughout the genome and is enriched in GC-rich regions as well as in CpG islands that contain constitutive replication origins. It is also found in enhancers other than those associated with Hsmar1 and MADE1. The role of SETMAR in the expression of genes, DNA replication and in DNA repair are discussed.
Collapse
Affiliation(s)
| | - Peter Arensburger
- Biological Sciences Department, California State Polytechnic University, Pomona, CA 91768, - United States
| | - Ahmed Arnaoty
- EA GICC, 7501, CHRU de Tours, 37044 TOURS, Cedex 09, France
| | - Sassan Asgari
- School of Biological Sciences, The University of Queensland, Brisbane, QLD 4072, Australia
| | - Martine Batailler
- PRC, UMR INRA 0085, CNRS 7247, Centre INRA Val de Loire, 37380 Nouzilly, France
| | - Linda Beauclair
- PRC, UMR INRA 0085, CNRS 7247, Centre INRA Val de Loire, 37380 Nouzilly, France
| | | | - Nicolas Buisine
- UMR CNRS 7221, Muséum National d'Histoire Naturelle, 75005 Paris, France
| | | | - Serge Guyetant
- Tumorothèque du CHRU de Tours, 37044 Tours, Cedex, France
| | - Laura Helou
- PRC, UMR INRA 0085, CNRS 7247, Centre INRA Val de Loire, 37380 Nouzilly, France
| | | | - Bruno Pitard
- Université de Nantes, CNRS ERL6001, Inserm 1232, CRCINA, F-44000 Nantes, France
| | - Isabelle Stévant
- Institut de Génomique Fonctionnelle de Lyon, Univ Lyon, CNRS UMR 5242, Ecole Normale Supérieure de Lyon, Université Claude Bernard Lyon, 1, 46 allée d'Italie, 69364 Lyon, France
| | - Yves Bigot
- PRC, UMR INRA 0085, CNRS 7247, Centre INRA Val de Loire, 37380 Nouzilly, France.
| |
Collapse
|
4
|
Guiblet WM, Cremona MA, Harris RS, Chen D, Eckert KA, Chiaromonte F, Huang YF, Makova KD. Non-B DNA: a major contributor to small- and large-scale variation in nucleotide substitution frequencies across the genome. Nucleic Acids Res 2021; 49:1497-1516. [PMID: 33450015 PMCID: PMC7897504 DOI: 10.1093/nar/gkaa1269] [Citation(s) in RCA: 65] [Impact Index Per Article: 16.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/05/2020] [Revised: 12/14/2020] [Accepted: 01/11/2021] [Indexed: 12/12/2022] Open
Abstract
Approximately 13% of the human genome can fold into non-canonical (non-B) DNA structures (e.g. G-quadruplexes, Z-DNA, etc.), which have been implicated in vital cellular processes. Non-B DNA also hinders replication, increasing errors and facilitating mutagenesis, yet its contribution to genome-wide variation in mutation rates remains unexplored. Here, we conducted a comprehensive analysis of nucleotide substitution frequencies at non-B DNA loci within noncoding, non-repetitive genome regions, their ±2 kb flanking regions, and 1-Megabase windows, using human-orangutan divergence and human single-nucleotide polymorphisms. Functional data analysis at single-base resolution demonstrated that substitution frequencies are usually elevated at non-B DNA, with patterns specific to each non-B DNA type. Mirror, direct and inverted repeats have higher substitution frequencies in spacers than in repeat arms, whereas G-quadruplexes, particularly stable ones, have higher substitution frequencies in loops than in stems. Several non-B DNA types also affect substitution frequencies in their flanking regions. Finally, non-B DNA explains more variation than any other predictor in multiple regression models for diversity or divergence at 1-Megabase scale. Thus, non-B DNA substantially contributes to variation in substitution frequencies at small and large scales. Our results highlight the role of non-B DNA in germline mutagenesis with implications to evolution and genetic diseases.
Collapse
Affiliation(s)
- Wilfried M Guiblet
- Bioinformatics and Genomics Graduate Program, Penn State University, UniversityPark, PA 16802, USA
| | - Marzia A Cremona
- Department of Statistics, The Pennsylvania State University, University Park, PA 16802, USA
- Department of Operations and Decision Systems, Université Laval, Canada
- CHU de Québec – Université Laval Research Center, Canada
| | - Robert S Harris
- Department of Biology, Penn State University, University Park, PA 16802, USA
| | - Di Chen
- Intercollege Graduate Degree Program in Genetics, Huck Institutes of the Life Sciences, Penn State University, UniversityPark, PA 16802, USA
| | - Kristin A Eckert
- Department of Pathology, Penn State University, College of Medicine, Hershey, PA 17033, USA
- Center for Medical Genomics, Penn State University, University Park and Hershey, PA, USA
| | - Francesca Chiaromonte
- Department of Statistics, The Pennsylvania State University, University Park, PA 16802, USA
- Center for Medical Genomics, Penn State University, University Park and Hershey, PA, USA
- EMbeDS, Sant’Anna School of Advanced Studies, 56127 Pisa, Italy
| | - Yi-Fei Huang
- Department of Biology, Penn State University, University Park, PA 16802, USA
- Center for Medical Genomics, Penn State University, University Park and Hershey, PA, USA
| | - Kateryna D Makova
- Department of Biology, Penn State University, University Park, PA 16802, USA
- Center for Medical Genomics, Penn State University, University Park and Hershey, PA, USA
| |
Collapse
|
5
|
Brahme A, Hultén M, Bengtsson C, Hultgren A, Zetterberg A. Radiation-Induced Chromosomal Breaks may be DNA Repair Fragile Sites with Larger-scale Correlations to Eight Double-Strand-Break Related Data Sets over the Human Genome. Radiat Res 2019; 192:562-576. [PMID: 31545677 DOI: 10.1667/rr15424.1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/03/2022]
Abstract
In this work, we compared the genomic distribution of common radiation-induced chromosomal breaks to eight different data sets covering the whole human genome. Sites with a high probability of chromatid breakage after exposure to low and high ionization density radiations were often located inside common and rare fragile sites, indicating that they may be a new and more local type of DNA repair-related fragility. Breaks in specific chromosome bands after acute exposure to oil and benzene also showed strong correlation with these sites and fragile sites. In addition, close correlation was found with cytologically detected chiasma and MLH1 immunofluorescence sites and with the HapMap recombination density distributions. Also, of interest, copy number changes occurred predominantly at radiation-induced breaks and fragile sites, at least for breast cancers with poor prognosis, and they decreased weakly but significantly in regions with increasing recombination and CpG density. An increased CpG density is linked to regions of high gene density to secure high-fidelity reproduction and survival. To minimize cancer induction, cancer-related genes are often located in regions of decreased recombination density and/or higher-than-average CpG density. It is compelling that all these data sets were influenced by the cells' handling of double-strand breaks and, more generally, DNA damage on its genome. In fact, the DNA repair genes systematically avoid regions with a high recombination density, as they need to be intact to accurately handle repairable DNA lesions.
Collapse
Affiliation(s)
- Anders Brahme
- Department of Oncology-Pathology, Karolinska Institutet, Box 260, SE-171 76 Stockholm, Sweden
| | - Maj Hultén
- Department of Molecular Medicine and Surgery, Karolinska Institutet, Karolinska University Hospital, S-171 76 Stockholm, Sweden
| | - Carin Bengtsson
- Department of Oncology-Pathology, Karolinska Institutet, Box 260, SE-171 76 Stockholm, Sweden
| | - Andreas Hultgren
- Department of Oncology-Pathology, Karolinska Institutet, Box 260, SE-171 76 Stockholm, Sweden
| | - Anders Zetterberg
- Department of Oncology-Pathology, Karolinska Institutet, Box 260, SE-171 76 Stockholm, Sweden
| |
Collapse
|
6
|
Arneson A, Ernst J. Systematic discovery of conservation states for single-nucleotide annotation of the human genome. Commun Biol 2019; 2:248. [PMID: 31286065 PMCID: PMC6606595 DOI: 10.1038/s42003-019-0488-1] [Citation(s) in RCA: 16] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/06/2019] [Accepted: 05/30/2019] [Indexed: 12/12/2022] Open
Abstract
Comparative genomics sequence data is an important source of information for interpreting genomes. Genome-wide annotations based on this data have largely focused on univariate scores or binary elements of evolutionary constraint. Here we present a complementary whole genome annotation approach, ConsHMM, which applies a multivariate hidden Markov model to learn de novo 'conservation states' based on the combinatorial and spatial patterns of which species align to and match a reference genome in a multiple species DNA sequence alignment. We applied ConsHMM to a 100-way vertebrate sequence alignment to annotate the human genome at single nucleotide resolution into 100 conservation states. These states have distinct enrichments for other genomic information including gene annotations, chromatin states, repeat families, and bases prioritized by various variant prioritization scores. Constrained elements have distinct heritability partitioning enrichments depending on their conservation state assignment. ConsHMM conservation states are a resource for analyzing genomes and genetic variants.
Collapse
Affiliation(s)
- Adriana Arneson
- Bioinformatics Interdepartmental Program, University of California, Los Angeles, Los Angeles, CA 90095 USA
- Department of Biological Chemistry, University of California, Los Angeles, Los Angeles, CA 90095 USA
| | - Jason Ernst
- Bioinformatics Interdepartmental Program, University of California, Los Angeles, Los Angeles, CA 90095 USA
- Department of Biological Chemistry, University of California, Los Angeles, Los Angeles, CA 90095 USA
- Eli and Edythe Broad Center of Regenerative Medicine and Stem Cell Research at University of California, Los Angeles, Los Angeles, CA 90095 USA
- Computer Science Department, University of California, Los Angeles, Los Angeles, CA 90095 USA
- Jonsson Comprehensive Cancer Center, University of California, Los Angeles, Los Angeles, CA 90095 USA
- Molecular Biology Institute, University of California, Los Angeles, Los Angeles, CA 90095 USA
| |
Collapse
|
7
|
Abstract
The 1000 Genomes Project created a valuable, worldwide reference for human genetic variation. Common uses of the 1000 Genomes dataset include genotype imputation supporting Genome-wide Association Studies, mapping expression Quantitative Trait Loci, filtering non-pathogenic variants from exome, whole genome and cancer genome sequencing projects, and genetic analysis of population structure and molecular evolution. In this article, we will highlight some of the multiple ways that the 1000 Genomes data can be and has been utilized for genetic studies.
Collapse
|
8
|
Terekhanova NV, Seplyarskiy VB, Soldatov RA, Bazykin GA. Evolution of Local Mutation Rate and Its Determinants. Mol Biol Evol 2017; 34:1100-1109. [PMID: 28138076 PMCID: PMC5850301 DOI: 10.1093/molbev/msx060] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/30/2022] Open
Abstract
Mutation rate varies along the human genome, and part of this variation is explainable by measurable local properties of the DNA molecule. Moreover, mutation rates differ between orthologous genomic regions of different species, but the drivers of this change are unclear. Here, we use data on human divergence from chimpanzee, human rare polymorphism, and human de novo mutations to predict the substitution rate at orthologous regions of non-human mammals. We show that the local mutation rates are very similar between human and apes, implying that their variation has a strong underlying cryptic component not explainable by the known genomic features. Mutation rates become progressively less similar in more distant species, and these changes are partially explainable by changes in the local genomic features of orthologous regions, most importantly, in the recombination rate. However, they are much more rapid, implying that the cryptic component underlying the mutation rate is more ephemeral than the known genomic features. These findings shed light on the determinants of mutation rate evolution. Key words local mutation rate, molecular evolution, recombination rate.
Collapse
Affiliation(s)
- Nadezhda V. Terekhanova
- Sector for Molecular Evolution, Institute for Information Transmission Problems of the RAS (Kharkevich Institute), Moscow, Russia
- M. V. Lomonosov Moscow State University, Moscow, Russia
| | - Vladimir B. Seplyarskiy
- Sector for Molecular Evolution, Institute for Information Transmission Problems of the RAS (Kharkevich Institute), Moscow, Russia
| | - Ruslan A. Soldatov
- Sector for Molecular Evolution, Institute for Information Transmission Problems of the RAS (Kharkevich Institute), Moscow, Russia
- M. V. Lomonosov Moscow State University, Moscow, Russia
| | - Georgii A. Bazykin
- Sector for Molecular Evolution, Institute for Information Transmission Problems of the RAS (Kharkevich Institute), Moscow, Russia
- M. V. Lomonosov Moscow State University, Moscow, Russia
- Skolkovo Institute of Science and Technology, Skolkovo, Russia
| |
Collapse
|
9
|
Bartolucci F, Chiaromonte F, Don PK, Lindsay BG. Composite Likelihood Inference in a Discrete Latent Variable Model for Two-Way “Clustering-by-Segmentation” Problems. J Comput Graph Stat 2017. [DOI: 10.1080/10618600.2016.1172018] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/22/2022]
Affiliation(s)
| | - Francesca Chiaromonte
- Department of Statistics, The Pennsylvania State University, State College, Pennsylvania
| | - Prabhani Kuruppumullage Don
- Department of Biostatistics and Computational Biology, Dana-Farber Cancer Institute, and Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, Massachusetts
| | - Bruce G. Lindsay
- Department of Statistics, The Pennsylvania State University, State College, Pennsylvania
| |
Collapse
|
10
|
Liu Y, Chiaromonte F, Li B. Structured Ordinary Least Squares: A Sufficient Dimension Reduction approach for regressions with partitioned predictors and heterogeneous units. Biometrics 2016; 73:529-539. [PMID: 27649087 DOI: 10.1111/biom.12579] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/01/2015] [Revised: 07/01/2016] [Accepted: 07/01/2016] [Indexed: 11/29/2022]
Abstract
In many scientific and engineering fields, advanced experimental and computing technologies are producing data that are not just high dimensional, but also internally structured. For instance, statistical units may have heterogeneous origins from distinct studies or subpopulations, and features may be naturally partitioned based on experimental platforms generating them, or on information available about their roles in a given phenomenon. In a regression analysis, exploiting this known structure in the predictor dimension reduction stage that precedes modeling can be an effective way to integrate diverse data. To pursue this, we propose a novel Sufficient Dimension Reduction (SDR) approach that we call structured Ordinary Least Squares (sOLS). This combines ideas from existing SDR literature to merge reductions performed within groups of samples and/or predictors. In particular, it leads to a version of OLS for grouped predictors that requires far less computation than recently proposed groupwise SDR procedures, and provides an informal yet effective variable selection tool in these settings. We demonstrate the performance of sOLS by simulation and present a first application to genomic data. The R package "sSDR," publicly available on CRAN, includes all procedures necessary to implement the sOLS approach.
Collapse
Affiliation(s)
- Yang Liu
- Department of Statistics, Pennsylvania State University, University Park, Pennsylvania 16802, U.S.A
| | - Francesca Chiaromonte
- Department of Statistics, Pennsylvania State University, University Park, Pennsylvania 16802, U.S.A
| | - Bing Li
- Department of Statistics, Pennsylvania State University, University Park, Pennsylvania 16802, U.S.A
| |
Collapse
|
11
|
Makova KD, Hardison RC. The effects of chromatin organization on variation in mutation rates in the genome. Nat Rev Genet 2015; 16:213-23. [PMID: 25732611 PMCID: PMC4500049 DOI: 10.1038/nrg3890] [Citation(s) in RCA: 160] [Impact Index Per Article: 16.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/12/2022]
Abstract
The variation in local rates of mutations can affect both the evolution of genes and their function in normal and cancer cells. Deciphering the molecular determinants of this variation will be aided by the elucidation of distinct types of mutations, as they differ in regional preferences and in associations with genomic features. Chromatin organization contributes to regional variation in mutation rates, but its contribution differs among mutation types. In both germline and somatic mutations, base substitutions are more abundant in regions of closed chromatin, perhaps reflecting error accumulation late in replication. By contrast, a distinctive mutational state with very high levels of insertions and deletions (indels) and substitutions is enriched in regions of open chromatin. These associations indicate an intricate interplay between the nucleotide sequence of DNA and its dynamic packaging into chromatin, and have important implications for current biomedical research. This Review focuses on recent studies showing associations between chromatin state and mutation rates, including pairwise and multivariate investigations of germline and somatic (particularly cancer) mutations.
Collapse
Affiliation(s)
- Kateryna D Makova
- Department of Biology, Huck Institute for Genome Sciences, The Pennsylvania State University, University Park, State College, Pennsylvania 16802, USA
| | - Ross C Hardison
- Department of Biochemistry and Molecular Biology, Huck Institute for Genome Sciences, The Pennsylvania State University, University Park, State College, Pennsylvania 16802, USA
| |
Collapse
|