1
|
Ribeiro M, Azevedo L, Santos AP, Pinto Leite P, Pereira MJ. Understanding spatiotemporal patterns of COVID-19 incidence in Portugal: A functional data analysis from August 2020 to March 2022. PLoS One 2024; 19:e0297772. [PMID: 38300912 PMCID: PMC10833534 DOI: 10.1371/journal.pone.0297772] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/14/2023] [Accepted: 01/12/2024] [Indexed: 02/03/2024] Open
Abstract
During the SARS-CoV-2 pandemic, governments and public health authorities collected massive amounts of data on daily confirmed positive cases and incidence rates. These data sets provide relevant information to develop a scientific understanding of the pandemic's spatiotemporal dynamics. At the same time, there is a lack of comprehensive approaches to describe and classify patterns underlying the dynamics of COVID-19 incidence across regions over time. This seriously constrains the potential benefits for public health authorities to understand spatiotemporal patterns of disease incidence that would allow for better risk communication strategies and improved assessment of mitigation policies efficacy. Within this context, we propose an exploratory statistical tool that combines functional data analysis with unsupervised learning algorithms to extract meaningful information about the main spatiotemporal patterns underlying COVID-19 incidence on mainland Portugal. We focus on the timeframe spanning from August 2020 to March 2022, considering data at the municipality level. First, we describe the temporal evolution of confirmed daily COVID-19 cases by municipality as a function of time, and outline the main temporal patterns of variability using a functional principal component analysis. Then, municipalities are classified according to their spatiotemporal similarities through hierarchical clustering adapted to spatially correlated functional data. Our findings reveal disparities in disease dynamics between northern and coastal municipalities versus those in the southern and hinterland. We also distinguish effects occurring during the 2020-2021 period from those in the 2021-2022 autumn-winter seasons. The results provide proof-of-concept that the proposed approach can be used to detect the main spatiotemporal patterns of disease incidence. The novel approach expands and enhances existing exploratory tools for spatiotemporal analysis of public health data.
Collapse
Affiliation(s)
- Manuel Ribeiro
- CERENA, DER, Instituto Superior Técnico, Universidade de Lisboa, Lisbon, Portugal
| | - Leonardo Azevedo
- CERENA, DER, Instituto Superior Técnico, Universidade de Lisboa, Lisbon, Portugal
| | - André Peralta Santos
- Direção de Serviços de Informação e Análise, Direção-Geral da Saúde, Lisbon, Portugal
- NOVA National School of Public Health, Public Health Research Centre, Universidade NOVA de Lisboa, Lisbon, Portugal
| | - Pedro Pinto Leite
- Direção de Serviços de Informação e Análise, Direção-Geral da Saúde, Lisbon, Portugal
| | - Maria João Pereira
- CERENA, DER, Instituto Superior Técnico, Universidade de Lisboa, Lisbon, Portugal
| |
Collapse
|
2
|
Cheng JH, Zheng C, Yamada R, Okada D. Visualization of the landscape of the read alignment shape of ATAC-seq data using Hellinger distance metric. Genes Cells 2024; 29:5-16. [PMID: 37989133 DOI: 10.1111/gtc.13082] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/08/2023] [Revised: 10/25/2023] [Accepted: 10/28/2023] [Indexed: 11/23/2023]
Abstract
Assay for Transposase-Accessible Chromatin using high-throughput sequencing (ATAC-seq) is the popular technique using next-generation sequencing to measure chromatin accessibility and identify open chromatin regions. While read alignment shape information of next-generation sequencing data with intensity information has been used in various bioinformatics methods, few studies have focused on pure shape information alone. In this study, we investigated what types of ATAC-seq read alignment shapes are observed for the promoter region and whether the pure shape information was related or unrelated to other gene features. We introduced a novel concept and pipeline for handling the pure shape information of NGS data as probability distributions and quantifying their dissimilarities by information theory. Based on this concept, we demonstrate that the pure shape information of ATAC-seq data is correlated with chromatin openness and some gene characteristics. On the other hand, it is suggested that the pure information of ATAC-seq read alignment shape is unlikely to contain additional information to explain differences in RNA expression. Our study suggests that viewing the read alignment shape of NGS data as probability distributions enables us to capture the characteristics of the genome-wide landscape of such data in a non-parametric manner.
Collapse
Affiliation(s)
- Jian Hao Cheng
- Center for Genomics Medicine, Graduate School of Medicine, Kyoto University, Kyoto, Japan
| | - Cheng Zheng
- Center for Genomics Medicine, Graduate School of Medicine, Kyoto University, Kyoto, Japan
| | - Ryo Yamada
- Center for Genomics Medicine, Graduate School of Medicine, Kyoto University, Kyoto, Japan
| | - Daigo Okada
- Center for Genomics Medicine, Graduate School of Medicine, Kyoto University, Kyoto, Japan
| |
Collapse
|
3
|
Madrigal P, Deng S, Feng Y, Militi S, Goh KJ, Nibhani R, Grandy R, Osnato A, Ortmann D, Brown S, Pauklin S. Epigenetic and transcriptional regulations prime cell fate before division during human pluripotent stem cell differentiation. Nat Commun 2023; 14:405. [PMID: 36697417 PMCID: PMC9876972 DOI: 10.1038/s41467-023-36116-9] [Citation(s) in RCA: 7] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/23/2022] [Accepted: 01/17/2023] [Indexed: 01/26/2023] Open
Abstract
Stem cells undergo cellular division during their differentiation to produce daughter cells with a new cellular identity. However, the epigenetic events and molecular mechanisms occurring between consecutive cell divisions have been insufficiently studied due to technical limitations. Here, using the FUCCI reporter we developed a cell-cycle synchronised human pluripotent stem cell (hPSC) differentiation system for uncovering epigenome and transcriptome dynamics during the first two divisions leading to definitive endoderm. We observed that transcription of key differentiation markers occurs before cell division, while chromatin accessibility analyses revealed the early inhibition of alternative cell fates. We found that Activator protein-1 members controlled by p38/MAPK signalling are necessary for inducing endoderm while blocking cell fate shifting toward mesoderm, and that enhancers are rapidly established and decommissioned between different cell divisions. Our study has practical biomedical utility for producing hPSC-derived patient-specific cell types since p38/MAPK induction increased the differentiation efficiency of insulin-producing pancreatic beta-cells.
Collapse
Affiliation(s)
- Pedro Madrigal
- Department of Surgery, University of Cambridge, Cambridge, CB2 0QQ, UK
- Wellcome Sanger Institute, Wellcome Genome Campus, Hinxton, CB10 1SA, UK
- Wellcome - MRC Cambridge Stem Cell Institute, University of Cambridge, Cambridge, CB2 0SZ, UK
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, CB10 1SD, UK
| | - Siwei Deng
- Botnar Research Centre, Nuffield Department of Orthopaedics, Rheumatology and Musculoskeletal Sciences, Old Road, University of Oxford, Headington, Oxford, OX3 7LD, UK
| | - Yuliang Feng
- Botnar Research Centre, Nuffield Department of Orthopaedics, Rheumatology and Musculoskeletal Sciences, Old Road, University of Oxford, Headington, Oxford, OX3 7LD, UK
| | - Stefania Militi
- Botnar Research Centre, Nuffield Department of Orthopaedics, Rheumatology and Musculoskeletal Sciences, Old Road, University of Oxford, Headington, Oxford, OX3 7LD, UK
| | - Kim Jee Goh
- Department of Surgery, University of Cambridge, Cambridge, CB2 0QQ, UK
- The Francis Crick Institute, London, NW1 1AT, UK
| | - Reshma Nibhani
- Botnar Research Centre, Nuffield Department of Orthopaedics, Rheumatology and Musculoskeletal Sciences, Old Road, University of Oxford, Headington, Oxford, OX3 7LD, UK
| | - Rodrigo Grandy
- Department of Surgery, University of Cambridge, Cambridge, CB2 0QQ, UK
| | - Anna Osnato
- Department of Surgery, University of Cambridge, Cambridge, CB2 0QQ, UK
| | - Daniel Ortmann
- Department of Surgery, University of Cambridge, Cambridge, CB2 0QQ, UK
| | - Stephanie Brown
- Department of Surgery, University of Cambridge, Cambridge, CB2 0QQ, UK
| | - Siim Pauklin
- Botnar Research Centre, Nuffield Department of Orthopaedics, Rheumatology and Musculoskeletal Sciences, Old Road, University of Oxford, Headington, Oxford, OX3 7LD, UK.
| |
Collapse
|
4
|
Craig SJ, Kenney AM, Lin J, Paul IM, Birch LL, Savage JS, Marini ME, Chiaromonte F, Reimherr ML, Makova KD. Constructing a polygenic risk score for childhood obesity using functional data analysis. ECONOMETRICS AND STATISTICS 2023; 25:66-86. [PMID: 36620476 PMCID: PMC9813976 DOI: 10.1016/j.ecosta.2021.10.014] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 06/17/2023]
Abstract
Obesity is a highly heritable condition that affects increasing numbers of adults and, concerningly, of children. However, only a small fraction of its heritability has been attributed to specific genetic variants. These variants are traditionally ascertained from genome-wide association studies (GWAS), which utilize samples with tens or hundreds of thousands of individuals for whom a single summary measurement (e.g., BMI) is collected. An alternative approach is to focus on a smaller, more deeply characterized sample in conjunction with advanced statistical models that leverage longitudinal phenotypes. Novel functional data analysis (FDA) techniques are used to capitalize on longitudinal growth information from a cohort of children between birth and three years of age. In an ultra-high dimensional setting, hundreds of thousands of single nucleotide polymorphisms (SNPs) are screened, and selected SNPs are used to construct two polygenic risk scores (PRS) for childhood obesity using a weighting approach that incorporates the dynamic and joint nature of SNP effects. These scores are significantly higher in children with (vs. without) rapid infant weight gain-a predictor of obesity later in life. Using two independent cohorts, it is shown that the genetic variants identified in very young children are also informative in older children and in adults, consistent with early childhood obesity being predictive of obesity later in life. In contrast, PRSs based on SNPs identified by adult obesity GWAS are not predictive of weight gain in the cohort of young children. This provides an example of a successful application of FDA to GWAS. This application is complemented with simulations establishing that a deeply characterized sample can be just as, if not more, effective than a comparable study with a cross-sectional response. Overall, it is demonstrated that a deep, statistically sophisticated characterization of a longitudinal phenotype can provide increased statistical power to studies with relatively small sample sizes; and shows how FDA approaches can be used as an alternative to the traditional GWAS.
Collapse
Affiliation(s)
- Sarah J.C. Craig
- Department of Biology, Penn State University, University Park
- Center for Medical Genomics, Penn State University, University Park, PA
| | - Ana M. Kenney
- Department of Statistics, Penn State University, University Park, PA
| | - Junli Lin
- Department of Statistics, Penn State University, University Park, PA
| | - Ian M. Paul
- Center for Medical Genomics, Penn State University, University Park, PA
- Department of Pediatrics, Penn State College of Medicine, Hershey, PA
| | - Leann L. Birch
- Department of Foods and Nutrition, University of Georgia, Athens, GA
| | - Jennifer S. Savage
- Department of Nutritional Sciences, Penn State University, University Park, PA
- Center for Childhood Obesity Research, Penn State University, University Park, PA
| | - Michele E. Marini
- Center for Childhood Obesity Research, Penn State University, University Park, PA
| | - Francesca Chiaromonte
- Center for Medical Genomics, Penn State University, University Park, PA
- Department of Statistics, Penn State University, University Park, PA
- EMbeDS, Sant’Anna School of Advanced Studies, Piazza Martiri della Libertà, Pisa, Italy
| | - Matthew L. Reimherr
- Center for Medical Genomics, Penn State University, University Park, PA
- Department of Statistics, Penn State University, University Park, PA
| | - Kateryna D. Makova
- Department of Biology, Penn State University, University Park
- Center for Medical Genomics, Penn State University, University Park, PA
| |
Collapse
|
5
|
He N, Wang W, Fang C, Tan Y, Li L, Hou C. Integration of Count Difference and Curve Similarity in Negative Regulatory Element Detection. Front Genet 2022; 13:818344. [PMID: 35251128 PMCID: PMC8896116 DOI: 10.3389/fgene.2022.818344] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/19/2021] [Accepted: 01/20/2022] [Indexed: 12/05/2022] Open
Abstract
Negative regulatory elements (NREs) down-regulate gene expression by inhibiting the activities of promoters or enhancers. The repressing activity of NREs can be measured globally by massively parallel reporter assays (MPRAs). However, most existing algorithms are designed for the statistical detection of positively enriched signals in MPRA datasets. To identify reduced signals in MPRA experiments, we designed a NRE identification program, fast-NR, by integrating the count and graphic features of sequenced reads to detect NREs using datasets generated by experiments of self-transcribing active regulatory region sequencing (STARR-seq). Fast-NR identified hundreds of silencers in human K562 cells that can be validated by independent methods.
Collapse
Affiliation(s)
- Na He
- Harbin Institute of Technology, Harbin, China
- Department of Biology, School of Life Sciences, Southern University of Science and Technology, Shenzhen, China
- *Correspondence: Chunhui Hou, ; Na He,
| | - Wenjing Wang
- School of Life Science and State Key Laboratory of Agrobiotechnology, The Chinese University of Hong Kong, Hong Kong, Hong Kong SAR, China
| | - Chao Fang
- Cancer Centre, Faculty of Health Sciences, University of Macau, Macao, China
| | - Yongjian Tan
- Department of Biology, School of Life Sciences, Southern University of Science and Technology, Shenzhen, China
| | - Li Li
- Department of Bioinformatics, Huazhong Agricultural University, Wuhan, China
- Hubei Key Laboratory of Agricultural Bioinformatics, Huazhong Agricultural University, Wuhan, China
| | - Chunhui Hou
- Department of Biology, School of Life Sciences, Southern University of Science and Technology, Shenzhen, China
- *Correspondence: Chunhui Hou, ; Na He,
| |
Collapse
|
6
|
Boschi T, Di Iorio J, Testa L, Cremona MA, Chiaromonte F. Functional data analysis characterizes the shapes of the first COVID-19 epidemic wave in Italy. Sci Rep 2021; 11:17054. [PMID: 34462450 PMCID: PMC8405612 DOI: 10.1038/s41598-021-95866-y] [Citation(s) in RCA: 12] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/19/2020] [Accepted: 07/27/2021] [Indexed: 12/11/2022] Open
Abstract
We investigate patterns of COVID-19 mortality across 20 Italian regions and their association with mobility, positivity, and socio-demographic, infrastructural and environmental covariates. Notwithstanding limitations in accuracy and resolution of the data available from public sources, we pinpoint significant trends exploiting information in curves and shapes with Functional Data Analysis techniques. These depict two starkly different epidemics; an "exponential" one unfolding in Lombardia and the worst hit areas of the north, and a milder, "flat(tened)" one in the rest of the country-including Veneto, where cases appeared concurrently with Lombardia but aggressive testing was implemented early on. We find that mobility and positivity can predict COVID-19 mortality, also when controlling for relevant covariates. Among the latter, primary care appears to mitigate mortality, and contacts in hospitals, schools and workplaces to aggravate it. The techniques we describe could capture additional and potentially sharper signals if applied to richer data.
Collapse
Affiliation(s)
- Tobia Boschi
- Dept. of Statistics and Huck Institutes of the Life Sciences, Penn State University, University Park, PA, 16802, USA
| | - Jacopo Di Iorio
- Institute of Economics and EMbeDS, Sant'Anna School of Advanced Studies, 56127, Pisa, Italy
| | - Lorenzo Testa
- Institute of Economics and EMbeDS, Sant'Anna School of Advanced Studies, 56127, Pisa, Italy
| | - Marzia A Cremona
- Dept. of Statistics and Huck Institutes of the Life Sciences, Penn State University, University Park, PA, 16802, USA. .,Dept. of Operations and Decision Systems, Université Laval, Quebec, G1V 0A6, Canada. .,CHU de Québec - Université Laval Research Center, Quebec, G1V 4G2, Canada.
| | - Francesca Chiaromonte
- Dept. of Statistics and Huck Institutes of the Life Sciences, Penn State University, University Park, PA, 16802, USA. .,Institute of Economics and EMbeDS, Sant'Anna School of Advanced Studies, 56127, Pisa, Italy.
| |
Collapse
|
7
|
Chen D, Cremona MA, Qi Z, Mitra RD, Chiaromonte F, Makova KD. Human L1 Transposition Dynamics Unraveled with Functional Data Analysis. Mol Biol Evol 2021; 37:3576-3600. [PMID: 32722770 DOI: 10.1093/molbev/msaa194] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/14/2022] Open
Abstract
Long INterspersed Elements-1 (L1s) constitute >17% of the human genome and still actively transpose in it. Characterizing L1 transposition across the genome is critical for understanding genome evolution and somatic mutations. However, to date, L1 insertion and fixation patterns have not been studied comprehensively. To fill this gap, we investigated three genome-wide data sets of L1s that integrated at different evolutionary times: 17,037 de novo L1s (from an L1 insertion cell-line experiment conducted in-house), and 1,212 polymorphic and 1,205 human-specific L1s (from public databases). We characterized 49 genomic features-proxying chromatin accessibility, transcriptional activity, replication, recombination, etc.-in the ±50 kb flanks of these elements. These features were contrasted between the three L1 data sets and L1-free regions using state-of-the-art Functional Data Analysis statistical methods, which treat high-resolution data as mathematical functions. Our results indicate that de novo, polymorphic, and human-specific L1s are surrounded by different genomic features acting at specific locations and scales. This led to an integrative model of L1 transposition, according to which L1s preferentially integrate into open-chromatin regions enriched in non-B DNA motifs, whereas they are fixed in regions largely free of purifying selection-depleted of genes and noncoding most conserved elements. Intriguingly, our results suggest that L1 insertions modify local genomic landscape by extending CpG methylation and increasing mononucleotide microsatellite density. Altogether, our findings substantially facilitate understanding of L1 integration and fixation preferences, pave the way for uncovering their role in aging and cancer, and inform their use as mutagenesis tools in genetic studies.
Collapse
Affiliation(s)
- Di Chen
- Intercollege Graduate Degree Program in Genetics, The Huck Institutes of the Life Sciences, The Pennsylvania State University, University Park, PA
| | - Marzia A Cremona
- Department of Statistics, The Pennsylvania State University, University Park, PA.,Department of Operations and Decision Systems, Université Laval, Québec, Canada
| | - Zongtai Qi
- Department of Genetics and Center for Genome Sciences and Systems Biology, Washington University School of Medicine, St. Louis, MO
| | - Robi D Mitra
- Department of Genetics and Center for Genome Sciences and Systems Biology, Washington University School of Medicine, St. Louis, MO
| | - Francesca Chiaromonte
- Department of Statistics, The Pennsylvania State University, University Park, PA.,EMbeDS, Sant'Anna School of Advanced Studies, Pisa, Italy.,The Huck Institutes of the Life Sciences, Center for Medical Genomics, The Pennsylvania State University, University Park, PA
| | - Kateryna D Makova
- The Huck Institutes of the Life Sciences, Center for Medical Genomics, The Pennsylvania State University, University Park, PA.,Department of Biology, The Pennsylvania State University, University Park, PA
| |
Collapse
|
8
|
Guiblet WM, Cremona MA, Harris RS, Chen D, Eckert KA, Chiaromonte F, Huang YF, Makova KD. Non-B DNA: a major contributor to small- and large-scale variation in nucleotide substitution frequencies across the genome. Nucleic Acids Res 2021; 49:1497-1516. [PMID: 33450015 PMCID: PMC7897504 DOI: 10.1093/nar/gkaa1269] [Citation(s) in RCA: 58] [Impact Index Per Article: 19.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/05/2020] [Revised: 12/14/2020] [Accepted: 01/11/2021] [Indexed: 12/12/2022] Open
Abstract
Approximately 13% of the human genome can fold into non-canonical (non-B) DNA structures (e.g. G-quadruplexes, Z-DNA, etc.), which have been implicated in vital cellular processes. Non-B DNA also hinders replication, increasing errors and facilitating mutagenesis, yet its contribution to genome-wide variation in mutation rates remains unexplored. Here, we conducted a comprehensive analysis of nucleotide substitution frequencies at non-B DNA loci within noncoding, non-repetitive genome regions, their ±2 kb flanking regions, and 1-Megabase windows, using human-orangutan divergence and human single-nucleotide polymorphisms. Functional data analysis at single-base resolution demonstrated that substitution frequencies are usually elevated at non-B DNA, with patterns specific to each non-B DNA type. Mirror, direct and inverted repeats have higher substitution frequencies in spacers than in repeat arms, whereas G-quadruplexes, particularly stable ones, have higher substitution frequencies in loops than in stems. Several non-B DNA types also affect substitution frequencies in their flanking regions. Finally, non-B DNA explains more variation than any other predictor in multiple regression models for diversity or divergence at 1-Megabase scale. Thus, non-B DNA substantially contributes to variation in substitution frequencies at small and large scales. Our results highlight the role of non-B DNA in germline mutagenesis with implications to evolution and genetic diseases.
Collapse
Affiliation(s)
- Wilfried M Guiblet
- Bioinformatics and Genomics Graduate Program, Penn State University, UniversityPark, PA 16802, USA
| | - Marzia A Cremona
- Department of Statistics, The Pennsylvania State University, University Park, PA 16802, USA
- Department of Operations and Decision Systems, Université Laval, Canada
- CHU de Québec – Université Laval Research Center, Canada
| | - Robert S Harris
- Department of Biology, Penn State University, University Park, PA 16802, USA
| | - Di Chen
- Intercollege Graduate Degree Program in Genetics, Huck Institutes of the Life Sciences, Penn State University, UniversityPark, PA 16802, USA
| | - Kristin A Eckert
- Department of Pathology, Penn State University, College of Medicine, Hershey, PA 17033, USA
- Center for Medical Genomics, Penn State University, University Park and Hershey, PA, USA
| | - Francesca Chiaromonte
- Department of Statistics, The Pennsylvania State University, University Park, PA 16802, USA
- Center for Medical Genomics, Penn State University, University Park and Hershey, PA, USA
- EMbeDS, Sant’Anna School of Advanced Studies, 56127 Pisa, Italy
| | - Yi-Fei Huang
- Department of Biology, Penn State University, University Park, PA 16802, USA
- Center for Medical Genomics, Penn State University, University Park and Hershey, PA, USA
| | - Kateryna D Makova
- Department of Biology, Penn State University, University Park, PA 16802, USA
- Center for Medical Genomics, Penn State University, University Park and Hershey, PA, USA
| |
Collapse
|
9
|
Mughal MR, Koch H, Huang J, Chiaromonte F, DeGiorgio M. Learning the properties of adaptive regions with functional data analysis. PLoS Genet 2020; 16:e1008896. [PMID: 32853200 PMCID: PMC7480868 DOI: 10.1371/journal.pgen.1008896] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/13/2019] [Revised: 09/09/2020] [Accepted: 05/29/2020] [Indexed: 12/12/2022] Open
Abstract
Identifying regions of positive selection in genomic data remains a challenge in population genetics. Most current approaches rely on comparing values of summary statistics calculated in windows. We present an approach termed SURFDAWave, which translates measures of genetic diversity calculated in genomic windows to functional data. By transforming our discrete data points to be outputs of continuous functions defined over genomic space, we are able to learn the features of these functions that signify selection. This enables us to confidently identify complex modes of natural selection, including adaptive introgression. We are also able to predict important selection parameters that are responsible for shaping the inferred selection events. By applying our model to human population-genomic data, we recapitulate previously identified regions of selective sweeps, such as OCA2 in Europeans, and predict that its beneficial mutation reached a frequency of 0.02 before it swept 1,802 generations ago, a time when humans were relatively new to Europe. In addition, we identify BNC2 in Europeans as a target of adaptive introgression, and predict that it harbors a beneficial mutation that arose in an archaic human population that split from modern humans within the hypothesized modern human-Neanderthal divergence range.
Collapse
Affiliation(s)
- Mehreen R. Mughal
- Bioinformatics and Genomics at the Huck Institutes of the Life Sciences, Pennsylvania State University, University Park, Pennsylvania, United States of America
| | - Hillary Koch
- Department of Statistics, Pennsylvania State University, University Park, Pennsylvania, United States of America
| | - Jinguo Huang
- Bioinformatics and Genomics at the Huck Institutes of the Life Sciences, Pennsylvania State University, University Park, Pennsylvania, United States of America
| | - Francesca Chiaromonte
- Department of Statistics, Pennsylvania State University, University Park, Pennsylvania, United States of America
| | - Michael DeGiorgio
- Department of Computer and Electrical Engineering and Computer Science, Florida Atlantic University, Boca Raton, Florida, United States of America
| |
Collapse
|