1
|
Gertheiss J, Rügamer D, Liew BXW, Greven S. Functional Data Analysis: An Introduction and Recent Developments. Biom J 2024; 66:e202300363. [PMID: 39330918 DOI: 10.1002/bimj.202300363] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/01/2023] [Revised: 05/17/2024] [Accepted: 05/27/2024] [Indexed: 09/28/2024]
Abstract
Functional data analysis (FDA) is a statistical framework that allows for the analysis of curves, images, or functions on higher dimensional domains. The goals of FDA, such as descriptive analyses, classification, and regression, are generally the same as for statistical analyses of scalar-valued or multivariate data, but FDA brings additional challenges due to the high- and infinite dimensionality of observations and parameters, respectively. This paper provides an introduction to FDA, including a description of the most common statistical analysis techniques, their respective software implementations, and some recent developments in the field. The paper covers fundamental concepts such as descriptives and outliers, smoothing, amplitude and phase variation, and functional principal component analysis. It also discusses functional regression, statistical inference with functional data, functional classification and clustering, and machine learning approaches for functional data analysis. The methods discussed in this paper are widely applicable in fields such as medicine, biophysics, neuroscience, and chemistry and are increasingly relevant due to the widespread use of technologies that allow for the collection of functional data. Sparse functional data methods are also relevant for longitudinal data analysis. All presented methods are demonstrated using available software in R by analyzing a dataset on human motion and motor control. To facilitate the understanding of the methods, their implementation, and hands-on application, the code for these practical examples is made available through a code and data supplement and on GitHub.
Collapse
Affiliation(s)
- Jan Gertheiss
- Departmesnt of Mathematics and Statistics, School of Economics and Social Sciences, Helmut Schmidt University, Hamburg, Germany
| | - David Rügamer
- Department of Statistics, LMU Munich, Munich, Germany
- Munich Center for Machine Learning, Munich, Germany
| | - Bernard X W Liew
- School of Sport, Rehabilitation and Exercise Sciences, University of Essex, Essex, UK
| | - Sonja Greven
- Chair of Statistics, School of Business and Economics, Humboldt-Universität zu Berlin, Berlin, Germany
| |
Collapse
|
2
|
Zhu C, Wang JL. Testing homogeneity: the trouble with sparse functional data. J R Stat Soc Series B Stat Methodol 2023; 85:705-731. [PMID: 37521166 PMCID: PMC10376451 DOI: 10.1093/jrsssb/qkad021] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/02/2022] [Revised: 12/06/2022] [Accepted: 02/25/2023] [Indexed: 08/01/2023]
Abstract
Testing the homogeneity between two samples of functional data is an important task. While this is feasible for intensely measured functional data, we explain why it is challenging for sparsely measured functional data and show what can be done for such data. In particular, we show that testing the marginal homogeneity based on point-wise distributions is feasible under some mild constraints and propose a new two-sample statistic that works well with both intensively and sparsely measured functional data. The proposed test statistic is formulated upon energy distance, and the convergence rate of the test statistic to its population version is derived along with the consistency of the associated permutation test. The aptness of our method is demonstrated on both synthetic and real data sets.
Collapse
Affiliation(s)
- Changbo Zhu
- Address for correspondence: Changbo Zhu, Department of Applied and Computational Mathematics and Statistics, University of Notre Dame, Notre Dame, IN 46556, USA.
| | - Jane-Ling Wang
- Department of Statistics, University of California, Davis, Davis, United States
| |
Collapse
|
3
|
Pini A, Sørensen H, Tolver A, Vantini S. Local inference for functional linear mixed models. Comput Stat Data Anal 2023. [DOI: 10.1016/j.csda.2022.107688] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/13/2023]
|
4
|
Fayaz M. The lock-down effects of COVID-19 on the air pollution indices in Iran and its neighbors. MODELING EARTH SYSTEMS AND ENVIRONMENT 2022; 9:669-675. [PMID: 36157916 PMCID: PMC9483498 DOI: 10.1007/s40808-022-01528-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 07/02/2022] [Accepted: 09/06/2022] [Indexed: 11/29/2022]
Abstract
Introduction The COVID-19 restrictions have a lot of various peripheral negative and positive effects, like economic shocks and decreasing air pollution, respectively. Many studies showed NO2 reduction in most parts of the world. Methods Iran and its land and maritime neighbors have about 7.4% of the world population and 6.3% and 5.8% of World COVID-19 cases and deaths, respectively. The air pollution indices of them such as CH4 (Methane), CO_1 (CO), H2O (Water), HCHO (Tropospheric Atmospheric Formaldehyde), NO2 (Nitrogen oxides), O3 (ozone), SO2 (Sulfur Dioxide), UVAI_AAI [UV Aerosol Index (UVAI)/Absorbing Aerosol Index (AAI)] are studied from the First quarter of 2019 to the fourth quarter of 2021 with Copernicus Sentinel 5 Precursor (S5P) satellite data set from Google Earth Engine. The outliers are detected based on the depth functions. We use a two-sample t test, Wilcoxon test, and interval-wise testing for functional data to control the familywise error rate. Result The adjusted p value comparison between Q2 of 2019 and Q2 of 2020 in NO2 for almost all countries is statistically significant except Iraq, UAE, Bahrain, Qatar, and Kuwait. But, the CO and HCHO are not statistically significant in any country. Although CH4, O3, and UVAI_AAI are statistically significant for some countries. In the Q2 comparison for NO2 between 2020 and 2021, only Iran, Armenia, Turkey, UAE, and Saudi Arabia are statistically significant. However, Ch4 is statistically significant for all countries except Azerbaijan. Conclusions The comparison with and without adjusted p values declares the decreases in some air pollution in these countries. Supplementary Information The online version contains supplementary material available at 10.1007/s40808-022-01528-x.
Collapse
Affiliation(s)
- Mohammad Fayaz
- Department of Biostatistics, School of Allied Medical Sciences, Shahid Beheshti University of Medical Sciences, Tehran, Iran
| |
Collapse
|
5
|
Yi Y, Billor N, Liang M, Cao X, Ekstrom A, Zheng J. Classification of EEG signals: An interpretable approach using functional data analysis. J Neurosci Methods 2022; 376:109609. [PMID: 35483504 DOI: 10.1016/j.jneumeth.2022.109609] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/28/2021] [Revised: 03/25/2022] [Accepted: 04/21/2022] [Indexed: 11/17/2022]
Abstract
Electroencephalography (EEG) is a noninvasive method to record electrical activity of the brain. The EEG data is continuous flow of voltages, in this paper, we consider them as functional data, and propose a three-stage algorithm based on functional data analysis, with the advantage of interpretability. Specifically, the time and frequency information are extracted by wavelet transform in the first stage. Then, functional testing is utilized to select EEG channels and frequencies that show significant differences for different human behaviors. In the third stage, we propose to use penalized multiple functional logistic regression to interpretably classify human behaviors. With simulation and a scalp EEG data as validation set, we show that the proposed three-stage algorithm provides an interpretable classification of the scalp EEG signals.
Collapse
Affiliation(s)
- Yuyan Yi
- Department of Mathematics and Statistics, Auburn University, USA.
| | - Nedret Billor
- Department of Mathematics and Statistics, Auburn University, USA.
| | - Mingli Liang
- Department of Psychiatry, Department of Neurosurgery, Yale University, USA.
| | - Xuan Cao
- Department of Mathematical Sciences, University of Cincinnati, USA.
| | - Arne Ekstrom
- Department of Psychology, University of Arizona, USA.
| | - Jingyi Zheng
- Department of Mathematics and Statistics, Auburn University, USA.
| |
Collapse
|
6
|
Codazzi L, Colombi A, Gianella M, Argiento R, Paci L, Pini A. Gaussian graphical modeling for spectrometric data analysis. Comput Stat Data Anal 2022. [DOI: 10.1016/j.csda.2021.107416] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
|
7
|
Chen D, Cremona MA, Qi Z, Mitra RD, Chiaromonte F, Makova KD. Human L1 Transposition Dynamics Unraveled with Functional Data Analysis. Mol Biol Evol 2021; 37:3576-3600. [PMID: 32722770 DOI: 10.1093/molbev/msaa194] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/14/2022] Open
Abstract
Long INterspersed Elements-1 (L1s) constitute >17% of the human genome and still actively transpose in it. Characterizing L1 transposition across the genome is critical for understanding genome evolution and somatic mutations. However, to date, L1 insertion and fixation patterns have not been studied comprehensively. To fill this gap, we investigated three genome-wide data sets of L1s that integrated at different evolutionary times: 17,037 de novo L1s (from an L1 insertion cell-line experiment conducted in-house), and 1,212 polymorphic and 1,205 human-specific L1s (from public databases). We characterized 49 genomic features-proxying chromatin accessibility, transcriptional activity, replication, recombination, etc.-in the ±50 kb flanks of these elements. These features were contrasted between the three L1 data sets and L1-free regions using state-of-the-art Functional Data Analysis statistical methods, which treat high-resolution data as mathematical functions. Our results indicate that de novo, polymorphic, and human-specific L1s are surrounded by different genomic features acting at specific locations and scales. This led to an integrative model of L1 transposition, according to which L1s preferentially integrate into open-chromatin regions enriched in non-B DNA motifs, whereas they are fixed in regions largely free of purifying selection-depleted of genes and noncoding most conserved elements. Intriguingly, our results suggest that L1 insertions modify local genomic landscape by extending CpG methylation and increasing mononucleotide microsatellite density. Altogether, our findings substantially facilitate understanding of L1 integration and fixation preferences, pave the way for uncovering their role in aging and cancer, and inform their use as mutagenesis tools in genetic studies.
Collapse
Affiliation(s)
- Di Chen
- Intercollege Graduate Degree Program in Genetics, The Huck Institutes of the Life Sciences, The Pennsylvania State University, University Park, PA
| | - Marzia A Cremona
- Department of Statistics, The Pennsylvania State University, University Park, PA.,Department of Operations and Decision Systems, Université Laval, Québec, Canada
| | - Zongtai Qi
- Department of Genetics and Center for Genome Sciences and Systems Biology, Washington University School of Medicine, St. Louis, MO
| | - Robi D Mitra
- Department of Genetics and Center for Genome Sciences and Systems Biology, Washington University School of Medicine, St. Louis, MO
| | - Francesca Chiaromonte
- Department of Statistics, The Pennsylvania State University, University Park, PA.,EMbeDS, Sant'Anna School of Advanced Studies, Pisa, Italy.,The Huck Institutes of the Life Sciences, Center for Medical Genomics, The Pennsylvania State University, University Park, PA
| | - Kateryna D Makova
- The Huck Institutes of the Life Sciences, Center for Medical Genomics, The Pennsylvania State University, University Park, PA.,Department of Biology, The Pennsylvania State University, University Park, PA
| |
Collapse
|
8
|
Qiu Z, Chen J, Zhang JT. Two-sample tests for multivariate functional data with applications. Comput Stat Data Anal 2021. [DOI: 10.1016/j.csda.2020.107160] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/22/2022]
|
9
|
Dannenmaier J, Kaltenbach C, Kölle T, Krischak G. Application of functional data analysis to explore movements: walking, running and jumping - A systematic review. Gait Posture 2020; 77:182-189. [PMID: 32058281 DOI: 10.1016/j.gaitpost.2020.02.002] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 05/14/2019] [Revised: 09/24/2019] [Accepted: 02/02/2020] [Indexed: 02/02/2023]
Abstract
Background Signals are continuously captured during the recording of motion data. Statistical analysis, however, usually uses only a few aspects of the recorded data. Functional data analysis offers the possibility to analyze the entire signal over time. Research question The review is based on the question of how functional data analysis is used in the study of lower limb movements. Methods The literature search was based on the databases EMBASE, PUBMED and OVID MEDLINE. All articles on the application of functional data analysis to motion-associated variables trajectories, ground reaction force,electromyography were included. The references were assessed independently by two reviewers. Results In total 1448 articles were found in the search. Finally, 13 articles were included in the review. All were of moderate methodological quality. The publication year of the studies ranges from 2009 to 2019. Healthy volunteers and persons with cruciate ligament injuries, knee osteoarthritis, gluteal tendinopathy, idiopathic torsional deformities, slipped capital femoral epiphysis and chronic ankle instability were examined in the studies. Movements were analyzed on basis of kinematics (3D motion analysis), ground reaction forces and electromyography. Functional Data Analysis was used in terms of landmark registration, functional principal component analysis, functional t-test and functional ANOVA. Significance Functional data analysis provides the possibility to gain detailed and in-depth insights into the analysis of motion patterns. As a result of the increase in references over the past year, the FDA is becoming more important in the analysis of continuous signals and the explorative analysis of movement data.
Collapse
Affiliation(s)
- Julia Dannenmaier
- Institute for Research in Rehabilitation Medicine at Ulm University (IFR Ulm), Bad Buchau, Germany
| | - Christina Kaltenbach
- Institute for Research in Rehabilitation Medicine at Ulm University (IFR Ulm), Bad Buchau, Germany
| | - Theresa Kölle
- Institute for Research in Rehabilitation Medicine at Ulm University (IFR Ulm), Bad Buchau, Germany
| | - Gert Krischak
- Institute for Research in Rehabilitation Medicine at Ulm University (IFR Ulm), Bad Buchau, Germany; Department of Orthopedics and Orthopedic Surgery, Federseeklinik, Bad Buchau, Germany.
| |
Collapse
|
10
|
Cremona MA, Pini A, Cumbo F, Makova KD, Chiaromonte F, Vantini S. IWTomics: testing high-resolution sequence-based 'Omics' data at multiple locations and scales. Bioinformatics 2019; 34:2289-2291. [PMID: 29474526 DOI: 10.1093/bioinformatics/bty090] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/05/2017] [Accepted: 02/20/2018] [Indexed: 11/13/2022] Open
Abstract
Summary With increased generation of high-resolution sequence-based 'Omics' data, detecting statistically significant effects at different genomic locations and scales has become key to addressing several scientific questions. IWTomics is an R/Bioconductor package (integrated in Galaxy) that, exploiting sophisticated Functional Data Analysis techniques (i.e. statistical techniques that deal with the analysis of curves), allows users to pre-process, visualize and test these data at multiple locations and scales. The package provides a friendly, flexible and complete workflow that can be employed in many genomic and epigenomic applications. Availability and implementation IWTomics is freely available at the Bioconductor website (http://bioconductor.org/packages/IWTomics) and on the main Galaxy instance (https://usegalaxy.org/). Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Marzia A Cremona
- Department of Statistics, The Pennsylvania State University, University Park, USA
| | - Alessia Pini
- MOX - Department of Mathematics, Politecnico di Milano, Milano, Italy
| | - Fabio Cumbo
- Department of Engineering, Third University of Rome, Italy.,Institute for Systems Analysis and Computer Science 'Antonio Ruberti', National Research Council of Italy, Rome, Italy
| | - Kateryna D Makova
- Center for Medical Genomics, The Huck Institutes of the Life Sciences.,Department of Biology, The Pennsylvania State University, University Park, USA
| | - Francesca Chiaromonte
- Department of Statistics, The Pennsylvania State University, University Park, USA.,Center for Medical Genomics, The Huck Institutes of the Life Sciences.,Sant'Anna School of Advanced Studies, Pisa, Italy
| | - Simone Vantini
- MOX - Department of Mathematics, Politecnico di Milano, Milano, Italy
| |
Collapse
|
11
|
|
12
|
Pini A, Spreafico L, Vantini S, Vietti A. Multi-aspect local inference for functional data: Analysis of ultrasound tongue profiles. J MULTIVARIATE ANAL 2019. [DOI: 10.1016/j.jmva.2018.11.006] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/24/2022]
|
13
|
|
14
|
Kraus D, Stefanucci M. Classification of functional fragments by regularized linear classifiers with domain selection. Biometrika 2018. [DOI: 10.1093/biomet/asy060] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Affiliation(s)
- David Kraus
- Department of Mathematics and Statistics, Masaryk University, Kotlářská 2, Brno, Czech Republic
| | - Marco Stefanucci
- Department of Statistical Sciences, Sapienza University of Rome, Piazzale Aldo Moro 5, Roma, Italy
| |
Collapse
|
15
|
|
16
|
Sharghi Ghale-Joogh H, Hosseini-Nasab SME. A two-sample test for mean functions with increasing number of projections. STATISTICS-ABINGDON 2018. [DOI: 10.1080/02331888.2018.1472599] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/14/2022]
|
17
|
Abramowicz K, Häger CK, Pini A, Schelin L, Sjöstedt de Luna S, Vantini S. Nonparametric inference for functional-on-scalar linear models applied to knee kinematic hop data after injury of the anterior cruciate ligament. Scand Stat Theory Appl 2018. [DOI: 10.1111/sjos.12333] [Citation(s) in RCA: 15] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022]
Affiliation(s)
- Konrad Abramowicz
- Department of Mathematics and Mathematical Statistics; Umeå University; Umeå Sweden
| | - Charlotte K. Häger
- Department of Community Medicine and Rehabilitation; Umeå University; Umeå Sweden
| | - Alessia Pini
- Department of Statistics, Umeå School of Business, Economics and Statistics; Umeå University; Umeå Sweden
- Department of Statistical Sciences; Università Cattolica del Sacro Cuore; Milan Italy
| | - Lina Schelin
- Department of Community Medicine and Rehabilitation; Umeå University; Umeå Sweden
- Department of Statistics, Umeå School of Business, Economics and Statistics; Umeå University; Umeå Sweden
| | | | - Simone Vantini
- MOX - Modelling and Scientific Computing Laboratory, Department of Mathematics; Politecnico di Milano; Milan Italy
| |
Collapse
|
18
|
|
19
|
Hébert-Losier K, Schelin L, Tengman E, Strong A, Häger CK. Curve analyses reveal altered knee, hip, and trunk kinematics during drop-jumps long after anterior cruciate ligament rupture. Knee 2018. [PMID: 29525548 DOI: 10.1016/j.knee.2017.12.005] [Citation(s) in RCA: 28] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Indexed: 02/02/2023]
Abstract
BACKGROUND Anterior cruciate ligament (ACL) ruptures may lead to knee dysfunctions later in life. Single-leg tasks are often evaluated, but bilateral movements may also be compromised. Our aim was to use curve analyses to examine double-leg drop-jump kinematics in ACL-reconstructed, ACL-deficient, and healthy-knee cohorts. METHODS Subjects with unilateral ACL ruptures treated more than two decades ago (17-28years) conservatively with physiotherapy (ACLPT, n=26) or in combination with reconstructive surgery (ACLR, n=28) and healthy-knee controls (n=25) performed 40-cm drop-jumps. Three-dimensional knee, hip, and trunk kinematics were analyzed during Rebound, Flight, and Landing phases. Curves were time-normalized and compared between groups (injured and non-injured legs of ACLPT and ACLR vs. non-dominant and dominant legs of controls) and within groups (between legs) using functional analysis of variance methods. RESULTS Compared to controls, ACL groups exhibited less knee and hip flexion on both legs during Rebound and greater knee external rotation on their injured leg at the start of Rebound and Landing. ACLR also showed less trunk flexion during Rebound. Between-leg differences were observed in ACLR only, with the injured leg more internally rotated at the hip. Overall, kinematic curves were similar between ACLR and ACLPT. However, compared to controls, deviations spanned a greater proportion of the drop-jump movement at the hip in ACLR and at the knee in ACLPT. CONCLUSIONS Trunk and bilateral leg kinematics during double-leg drop-jumps are still compromised long after ACL-rupture care, independent of treatment. Curve analyses indicate the presence of distinct compensatory mechanisms in ACLPT and ACLR compared to controls.
Collapse
Affiliation(s)
- Kim Hébert-Losier
- The University of Waikato, Faculty of Health, Sport and Human Performance, Adams Centre for High Performance, 52 Miro Street, Mount Maunganui, Tauranga 3116, New Zealand.
| | - Lina Schelin
- Umeå University, Department of Statistics, Umeå School of Business and Economics, 901 87 Umeå, Sweden
| | - Eva Tengman
- Umeå University, Department of Community Medicine and Rehabilitation Physiotherapy, 901 87 Umeå, Sweden
| | - Andrew Strong
- Umeå University, Department of Community Medicine and Rehabilitation Physiotherapy, 901 87 Umeå, Sweden
| | - Charlotte K Häger
- Umeå University, Department of Community Medicine and Rehabilitation Physiotherapy, 901 87 Umeå, Sweden
| |
Collapse
|
20
|
Affiliation(s)
- A. Pini
- MOX – Department of Mathematics, Politecnico di Milano, Milan, Italy
| | - S. Vantini
- MOX – Department of Mathematics, Politecnico di Milano, Milan, Italy
| |
Collapse
|
21
|
Abstract
We congratulate the authors for their excellent work that provides a clear overview of the large and now mature field of regression models for functional data. We here complement their discussion indicating some directions of further research that we deem particularly important.
Collapse
|
22
|
Campos-Sánchez R, Cremona MA, Pini A, Chiaromonte F, Makova KD. Integration and Fixation Preferences of Human and Mouse Endogenous Retroviruses Uncovered with Functional Data Analysis. PLoS Comput Biol 2016; 12:e1004956. [PMID: 27309962 PMCID: PMC4911145 DOI: 10.1371/journal.pcbi.1004956] [Citation(s) in RCA: 32] [Impact Index Per Article: 3.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/15/2016] [Accepted: 04/29/2016] [Indexed: 01/24/2023] Open
Abstract
Endogenous retroviruses (ERVs), the remnants of retroviral infections in the germ line, occupy ~8% and ~10% of the human and mouse genomes, respectively, and affect their structure, evolution, and function. Yet we still have a limited understanding of how the genomic landscape influences integration and fixation of ERVs. Here we conducted a genome-wide study of the most recently active ERVs in the human and mouse genome. We investigated 826 fixed and 1,065 in vitro HERV-Ks in human, and 1,624 fixed and 242 polymorphic ETns, as well as 3,964 fixed and 1,986 polymorphic IAPs, in mouse. We quantitated >40 human and mouse genomic features (e.g., non-B DNA structure, recombination rates, and histone modifications) in ±32 kb of these ERVs' integration sites and in control regions, and analyzed them using Functional Data Analysis (FDA) methodology. In one of the first applications of FDA in genomics, we identified genomic scales and locations at which these features display their influence, and how they work in concert, to provide signals essential for integration and fixation of ERVs. The investigation of ERVs of different evolutionary ages (young in vitro and polymorphic ERVs, older fixed ERVs) allowed us to disentangle integration vs. fixation preferences. As a result of these analyses, we built a comprehensive model explaining the uneven distribution of ERVs along the genome. We found that ERVs integrate in late-replicating AT-rich regions with abundant microsatellites, mirror repeats, and repressive histone marks. Regions favoring fixation are depleted of genes and evolutionarily conserved elements, and have low recombination rates, reflecting the effects of purifying selection and ectopic recombination removing ERVs from the genome. In addition to providing these biological insights, our study demonstrates the power of exploiting multiple scales and localization with FDA. These powerful techniques are expected to be applicable to many other genomic investigations.
Collapse
Affiliation(s)
- Rebeca Campos-Sánchez
- Genetics Graduate Program, The Huck Institutes of the Life Sciences, Penn State University, University Park, Pennsylvania, United States of America
| | - Marzia A. Cremona
- MOX—Modeling and Scientific Computing, Department of Mathematics, Politecnico di Milano, Milano, Italy
- Department of Statistics, Penn State University, University Park, Pennsylvania, United States of America
| | - Alessia Pini
- MOX—Modeling and Scientific Computing, Department of Mathematics, Politecnico di Milano, Milano, Italy
| | - Francesca Chiaromonte
- Department of Statistics, Penn State University, University Park, Pennsylvania, United States of America
- Center for Medical Genomics, The Huck Institutes of the Life Sciences, Penn State University, University Park, Pennsylvania, United States of America
| | - Kateryna D. Makova
- Center for Medical Genomics, The Huck Institutes of the Life Sciences, Penn State University, University Park, Pennsylvania, United States of America
- Department of Biology, Penn State University, University Park, Pennsylvania, United States of America
| |
Collapse
|