1
|
Barbanti L, Hothorn T. A transformation perspective on marginal and conditional models. Biostatistics 2024; 25:402-428. [PMID: 36534895 DOI: 10.1093/biostatistics/kxac048] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/24/2022] [Revised: 11/02/2022] [Accepted: 11/28/2022] [Indexed: 08/04/2023] Open
Abstract
Clustered observations are ubiquitous in controlled and observational studies and arise naturally in multicenter trials or longitudinal surveys. We present a novel model for the analysis of clustered observations where the marginal distributions are described by a linear transformation model and the correlations by a joint multivariate normal distribution. The joint model provides an analytic formula for the marginal distribution. Owing to the richness of transformation models, the techniques are applicable to any type of response variable, including bounded, skewed, binary, ordinal, or survival responses. We demonstrate how the common normal assumption for reaction times can be relaxed in the sleep deprivation benchmark data set and report marginal odds ratios for the notoriously difficult toe nail data. We furthermore discuss the analysis of two clinical trials aiming at the estimation of marginal treatment effects. In the first trial, pain was repeatedly assessed on a bounded visual analog scale and marginal proportional-odds models are presented. The second trial reported disease-free survival in rectal cancer patients, where the marginal hazard ratio from Weibull and Cox models is of special interest. An empirical evaluation compares the performance of the novel approach to general estimation equations for binary responses and to conditional mixed-effects models for continuous responses. An implementation is available in the tram add-on package to the R system and was benchmarked against established models in the literature.
Collapse
Affiliation(s)
- Luisa Barbanti
- Institut für Epidemiologie, Biostatistik und Prävention, Universität Zürich, Hirschengraben 84, CH-8001 Zürich, Switzerland
| | - Torsten Hothorn
- Institut für Epidemiologie, Biostatistik und Prävention, Universität Zürich, Hirschengraben 84, CH-8001 Zürich, Switzerland
| |
Collapse
|
2
|
Abstract
We assessed several agreement coefficients applied in 2x2 contingency tables, which are commonly applied in research due to dichotomization. Here, we not only studied some specific estimators but also developed a general method for the study of any estimator candidate to be an agreement measurement. This method was developed in open-source R codes and it is available to the researchers. We tested this method by verifying the performance of several traditional estimators over all possible configurations with sizes ranging from 1 to 68 (total of 1,028,789 tables). Cohen's kappa showed handicapped behavior similar to Pearson's r, Yule's Q, and Yule's Y. Scott's pi, and Shankar and Bangdiwala's B seem to better assess situations of disagreement than agreement between raters. Krippendorff's alpha emulates, without any advantage, Scott's pi in cases with nominal variables and two raters. Dice's F1 and McNemar's chi-squared incompletely assess the information of the contingency table, showing the poorest performance among all. We concluded that Cohen's kappa is a measurement of association and McNemar's chi-squared assess neither association nor agreement; the only two authentic agreement estimators are Holley and Guilford's G and Gwet's AC1. The latter two estimators also showed the best performance over the range of table sizes and should be considered as the first choices for agreement measurement in contingency 2x2 tables. All procedures and data were implemented in R and are available to download from Harvard Dataverse https://doi.org/10.7910/DVN/HMYTCK.
Collapse
Affiliation(s)
- Paulo Sergio Panse Silveira
- Department of Pathology (LIM01-HCFMUSP), Medical School, University of Sao Paulo, Av. Dr. Arnaldo 455, room 1103, Sao Paulo, SP, 01246-903, Brazil
- Department of Legal Medicine, Bioethics, Occupational Medicine and Physical Medicine and Rehabilitation, Medical School, University of Sao Paulo, Av. Dr. Arnaldo 455, room 1103, Sao Paulo, 01246-903, Brazil
| | - Jose Oliveira Siqueira
- Department of Legal Medicine, Bioethics, Occupational Medicine and Physical Medicine and Rehabilitation, Medical School, University of Sao Paulo, Av. Dr. Arnaldo 455, room 1103, Sao Paulo, 01246-903, Brazil.
| |
Collapse
|
3
|
Cai L, Chung SW, Lee T. Incremental Model Fit Assessment in the Case of Categorical Data: Tucker-Lewis Index for Item Response Theory Modeling. Prev Sci 2021; 24:455-466. [PMID: 33970410 PMCID: PMC10115722 DOI: 10.1007/s11121-021-01253-4] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 04/26/2021] [Indexed: 11/25/2022]
Abstract
The Tucker-Lewis index (TLI; Tucker & Lewis, 1973), also known as the non-normed fit index (NNFI; Bentler & Bonett, 1980), is one of the numerous incremental fit indices widely used in linear mean and covariance structure modeling, particularly in exploratory factor analysis, tools popular in prevention research. It augments information provided by other indices such as the root-mean-square error of approximation (RMSEA). In this paper, we develop and examine an analogous index for categorical item level data modeled with item response theory (IRT). The proposed Tucker-Lewis index for IRT (TLIRT) is based on Maydeu-Olivares and Joe's (2005) [Formula: see text] family of limited-information overall model fit statistics. The limited-information fit statistics have significantly better Chi-square approximation and power than traditional full-information Pearson or likelihood ratio statistics under realistic situations. Building on the incremental fit assessment principle, the TLIRT compares the fit of model under consideration along a spectrum of worst to best possible model fit scenarios. We examine the performance of the new index using simulated and empirical data. Results from a simulation study suggest that the new index behaves as theoretically expected, and it can offer additional insights about model fit not available from other sources. In addition, a more stringent cutoff value is perhaps needed than Hu and Bentler's (1999) traditional cutoff criterion with continuous variables. In the empirical data analysis, we use a data set from a measurement development project in support of cigarette smoking cessation research to illustrate the usefulness of the TLIRT. We noticed that had we only utilized the RMSEA index, we could have arrived at qualitatively different conclusions about model fit, depending on the choice of test statistics, an issue to which the TLIRT is relatively more immune.
Collapse
Affiliation(s)
- Li Cai
- University of California, UCLA/CRESST, 315 GSEIS Bldg, Los Angeles, 90095-1522, CA, USA.
| | | | | |
Collapse
|
4
|
Abstract
Current computations of commonly used fit indices in structural equation modeling (SEM), such as RMSEA and CFI, indicate much better fit when the data are categorical than if the same data had not been categorized. As a result, researchers may be led to accept poorly fitting models with greater frequency when data are categorical. In this article, I first explain why the current computations of categorical fit indices lead to this problematic behavior. I then propose and evaluate alternative ways to compute fit indices with categorical data. The proposed computations approximate what the fit index values would have been had the data not been categorized. The developments in this article are for the DWLS (diagonally weighted least squares) estimator, a popular limited information categorical estimation method. I report on the results of a simulation comparing existing and newly proposed categorical fit indices. The results confirmed the theoretical expectation that the new indices better match the corresponding values with continuous data. The new fit indices performed well across all studied conditions, with the exception of binary data at the smallest studied sample size (N = 200), when all categorical fit indices performed poorly.
Collapse
Affiliation(s)
- Victoria Savalei
- Department of Psychology, University of British Columbia, Vancouver, Canada
| |
Collapse
|
5
|
Meyerson W, Leisman J, Navarro FCP, Gerstein M. Origins and characterization of variants shared between databases of somatic and germline human mutations. BMC Bioinformatics 2020; 21:227. [PMID: 32498674 PMCID: PMC7273669 DOI: 10.1186/s12859-020-3508-8] [Citation(s) in RCA: 9] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/24/2020] [Accepted: 04/20/2020] [Indexed: 01/26/2023] Open
Abstract
Background Mutations arise in the human genome in two major settings: the germline and the soma. These settings involve different inheritance patterns, time scales, chromatin structures, and environmental exposures, all of which impact the resulting distribution of substitutions. Nonetheless, many of the same single nucleotide variants (SNVs) are shared between germline and somatic mutation databases, such as between the gnomAD database of 120,000 germline exomes and the TCGA database of 10,000 somatic exomes. Here, we sought to explain this overlap. Results After strict filtering to exclude common germline polymorphisms and sites with poor coverage or mappability, we found 336,987 variants shared between the somatic and germline databases. A uniform statistical model explains 34% of these shared variants; a model that incorporates the varying mutation rates of the basic mutation types explains another 50% of shared variants; and a model that includes extended nucleotide contexts (e.g. surrounding 3 bases on either side) explains an additional 4% of shared variants. Analysis of read depth finds mixed evidence that up to 4% of the shared variants may represent germline variants leaked into somatic call sets. 9% of the shared variants are not explained by any model. Sequencing errors and convergent evolution did not account for these. We surveyed other factors as well: Cancers driven by endogenous mutational processes share a greater fraction of variants with the germline, and recently derived germline variants were more likely to be somatically shared than were ancient germline ones. Conclusions Overall, we find that shared variants largely represent bona fide biological occurrences of the same variant in the germline and somatic setting and arise primarily because DNA has some of the same basic chemical vulnerabilities in either setting. Moreover, we find mixed evidence that somatic call-sets leak appreciable numbers of germline variants, which is relevant to genomic privacy regulations. In future studies, the similar chemical vulnerability of DNA between the somatic and germline settings might be used to help identify disease-related genes by guiding the development of background-mutation models that are informed by both somatic and germline patterns of variation.
Collapse
Affiliation(s)
- William Meyerson
- Computational Biology & Bioinformatics, Yale University, New Haven, CT, 06511, USA. .,Yale School of Medicine, Yale University, New Haven, CT, 06510, USA.
| | - John Leisman
- Molecular, Cellular and Developmental Biology, Yale University, New Haven, CT, 06510, USA
| | - Fabio C P Navarro
- Computational Biology & Bioinformatics, Yale University, New Haven, CT, 06511, USA.,Molecular Biophysics & Biochemistry, Yale University, New Haven, CT, 06511, USA
| | - Mark Gerstein
- Computational Biology & Bioinformatics, Yale University, New Haven, CT, 06511, USA. .,Yale School of Medicine, Yale University, New Haven, CT, 06510, USA. .,Molecular Biophysics & Biochemistry, Yale University, New Haven, CT, 06511, USA. .,Department of Computer Science, Yale University, New Haven, CT, 06511, USA.
| |
Collapse
|
6
|
Dai X, Fu G, Reese R. Detecting PCOS susceptibility loci from genome-wide association studies via iterative trend correlation based feature screening. BMC Bioinformatics 2020; 21:177. [PMID: 32366216 PMCID: PMC7199379 DOI: 10.1186/s12859-020-3492-z] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/03/2019] [Accepted: 04/13/2020] [Indexed: 01/18/2023] Open
Abstract
Background Feature screening plays a critical role in handling ultrahigh dimensional data analyses when the number of features exponentially exceeds the number of observations. It is increasingly common in biomedical research to have case-control (binary) response and an extremely large-scale categorical features. However, the approach considering such data types is limited in extant literature. In this article, we propose a new feature screening approach based on the iterative trend correlation (ITC-SIS, for short) to detect important susceptibility loci that are associated with the polycystic ovary syndrome (PCOS) affection status by screening 731,442 SNP features that were collected from the genome-wide association studies. Results We prove that the trend correlation based screening approach satisfies the theoretical strong screening consistency property under a set of reasonable conditions, which provides an appealing theoretical support for its outperformance. We demonstrate that the finite sample performance of ITC-SIS is accurate and fast through various simulation designs. Conclusion ITC-SIS serves as a good alternative method to detect disease susceptibility loci for clinic genomic data.
Collapse
Affiliation(s)
- Xiaotian Dai
- Department of Mathematical Sciences, SUNY Binghamton University, New York, USA
| | - Guifang Fu
- Department of Mathematical Sciences, SUNY Binghamton University, New York, USA.
| | | |
Collapse
|
7
|
Koffer RE, Ram N, Almeida DM. More than Counting: An Intraindividual Variability Approach to Categorical Repeated Measures. J Gerontol B Psychol Sci Soc Sci 2017; 73:87-99. [PMID: 29029333 PMCID: PMC5927081 DOI: 10.1093/geronb/gbx086] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/11/2016] [Accepted: 06/02/2017] [Indexed: 11/13/2022] Open
Abstract
Objectives Age-related differences in daily experiences are often described using summaries of categorical repeated measures, including typologies of stressors, activities, social partners, and coping strategies. This paper illustrates how an intraindividual variability (IIV) framework can be used to extract additional meaning from categorical IIV data. Method Using 8-occasion categorical data on daily stressors from the National Study of Daily Experiences (N = 1,499, MAge = 46.74, SDAge= 12.91), we derive and compute six IIV metrics that invoke numeric and nominal measurement of the central tendency, dispersion, and asymmetry of individuals' stressor experiences and examine how these metrics, relative dominance, diversity, log-skew and mode, spread, order, are related to age and interindividual differences in negative affect. Results Results demonstrate the utility of the numeric and nominal categorical IIV metrics, with theoretically meaningful age gradients in the three numeric IIV stressor metrics and five of six IIV metrics mapping differences in negative affect. Discussion Findings highlight how the unique constructs measured by these six metrics of categorical IIV may be used to examine dynamic process, study interindividual and age-related differences, and expand the variety of developmental research questions that may be answered using categorical repeated measures data.
Collapse
Affiliation(s)
- Rachel E Koffer
- Department of Human Development & Family Studies, The Pennsylvania State University, University Park
| | - Nilam Ram
- Department of Human Development & Family Studies, The Pennsylvania State University, University Park
- German Institute for Economic Research (DIW), Berlin
| | - David M Almeida
- Department of Human Development & Family Studies, The Pennsylvania State University, University Park
| |
Collapse
|