1
|
Chen Y, Li C, Ouyang J, Xu G. DIF Statistical Inference Without Knowing Anchoring Items. PSYCHOMETRIKA 2023; 88:1097-1122. [PMID: 37550561 PMCID: PMC10656337 DOI: 10.1007/s11336-023-09930-9] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 02/24/2022] [Accepted: 07/05/2023] [Indexed: 08/09/2023]
Abstract
Establishing the invariance property of an instrument (e.g., a questionnaire or test) is a key step for establishing its measurement validity. Measurement invariance is typically assessed by differential item functioning (DIF) analysis, i.e., detecting DIF items whose response distribution depends not only on the latent trait measured by the instrument but also on the group membership. DIF analysis is confounded by the group difference in the latent trait distributions. Many DIF analyses require knowing several anchor items that are DIF-free in order to draw inferences on whether each of the rest is a DIF item, where the anchor items are used to identify the latent trait distributions. When no prior information on anchor items is available, or some anchor items are misspecified, item purification methods and regularized estimation methods can be used. The former iteratively purifies the anchor set by a stepwise model selection procedure, and the latter selects the DIF-free items by a LASSO-type regularization approach. Unfortunately, unlike the methods based on a correctly specified anchor set, these methods are not guaranteed to provide valid statistical inference (e.g., confidence intervals and p-values). In this paper, we propose a new method for DIF analysis under a multiple indicators and multiple causes (MIMIC) model for DIF. This method adopts a minimal [Formula: see text] norm condition for identifying the latent trait distributions. Without requiring prior knowledge about an anchor set, it can accurately estimate the DIF effects of individual items and further draw valid statistical inferences for quantifying the uncertainty. Specifically, the inference results allow us to control the type-I error for DIF detection, which may not be possible with item purification and regularized estimation methods. We conduct simulation studies to evaluate the performance of the proposed method and compare it with the anchor-set-based likelihood ratio test approach and the LASSO approach. The proposed method is applied to analysing the three personality scales of the Eysenck personality questionnaire-revised (EPQ-R).
Collapse
Affiliation(s)
- Yunxiao Chen
- London School of Economics and Political Science, London, UK.
| | | | | | | |
Collapse
|
2
|
Lai MHC. Adjusting for Measurement Noninvariance with Alignment in Growth Modeling. MULTIVARIATE BEHAVIORAL RESEARCH 2023; 58:30-47. [PMID: 34236919 DOI: 10.1080/00273171.2021.1941730] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/13/2023]
Abstract
Longitudinal measurement invariance-the consistency of measurement in data collected over time-is a prerequisite for any meaningful inferences of growth patterns. When one or more items measuring the construct of interest show noninvariant measurement properties over time, it leads to biased parameter estimates and inferences on the growth parameters. In this paper, I extend the recently developed alignment-within-confirmatory factor analysis (AwC) technique to adjust for measurement biases for growth models. The proposed AwC method does not require a priori knowledge of noninvariant items and the iterative searching of noninvariant items in typical longitudinal measurement invariance research. Results of a Monte Carlo simulation study comparing AwC with the partial invariance modeling method show that AwC largely reduces biases in growth parameter estimates and gives good control of Type I error rates, especially when the sample size is at least 1,000. It also outperforms the partial invariance method in conditions when all items are noninvariant. However, all methods give biased growth parameter estimates when the proportion of noninvariant parameters is over 25%. Based on the simulation results, I conclude that AO is a viable alternative to the partial invariance method in growth modeling when it is not clear whether longitudinal measurement invariance holds. The current paper also demonstrates AwC in an example modeling neuroticism over three time points using a public data set, which shows how researchers can compute effect size indices for noninvariance in AwC to assess to what degree invariance holds and whether AwC results are trustworthy.
Collapse
Affiliation(s)
- Mark H C Lai
- Department of Psychology, University of Southern California
| |
Collapse
|
3
|
Comparing the Robustness of the Structural after Measurement (SAM) Approach to Structural Equation Modeling (SEM) against Local Model Misspecifications with Alternative Estimation Approaches. STATS 2022. [DOI: 10.3390/stats5030039] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/17/2022] Open
Abstract
Structural equation models (SEM), or confirmatory factor analysis as a special case, contain model parameters at the measurement part and the structural part. In most social-science SEM applications, all parameters are simultaneously estimated in a one-step approach (e.g., with maximum likelihood estimation). In a recent article, Rosseel and Loh (2022, Psychol. Methods) proposed a two-step structural after measurement (SAM) approach to SEM that estimates the parameters of the measurement model in the first step and the parameters of the structural model in the second step. Rosseel and Loh claimed that SAM is more robust to local model misspecifications (i.e., cross loadings and residual correlations) than one-step maximum likelihood estimation. In this article, it is demonstrated with analytical derivations and simulation studies that SAM is generally not more robust to misspecifications than one-step estimation approaches. Alternative estimation methods are proposed that provide more robustness to misspecifications. SAM suffers from finite-sample bias that depends on the size of factor reliability and factor correlations. A bootstrap-bias-corrected LSAM estimate provides less biased estimates in finite samples. Nevertheless, we argue in the discussion section that applied researchers should nevertheless adopt SAM because robustness to local misspecifications is an irrelevant property when applying SAM. Parameter estimates in a structural model are of interest because intentionally misspecified SEMs frequently offer clearly interpretable factors. In contrast, SEMs with some empirically driven model modifications will result in biased estimates of the structural parameters because the meaning of factors is unintentionally changed.
Collapse
|
4
|
Robitzsch A. On the Choice of the Item Response Model for Scaling PISA Data: Model Selection Based on Information Criteria and Quantifying Model Uncertainty. ENTROPY (BASEL, SWITZERLAND) 2022; 24:760. [PMID: 35741481 PMCID: PMC9223051 DOI: 10.3390/e24060760] [Citation(s) in RCA: 4] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 04/19/2022] [Revised: 05/12/2022] [Accepted: 05/25/2022] [Indexed: 01/10/2023]
Abstract
In educational large-scale assessment studies such as PISA, item response theory (IRT) models are used to summarize students' performance on cognitive test items across countries. In this article, the impact of the choice of the IRT model on the distribution parameters of countries (i.e., mean, standard deviation, percentiles) is investigated. Eleven different IRT models are compared using information criteria. Moreover, model uncertainty is quantified by estimating model error, which can be compared with the sampling error associated with the sampling of students. The PISA 2009 dataset for the cognitive domains mathematics, reading, and science is used as an example of the choice of the IRT model. It turned out that the three-parameter logistic IRT model with residual heterogeneity and a three-parameter IRT model with a quadratic effect of the ability θ provided the best model fit. Furthermore, model uncertainty was relatively small compared to sampling error regarding country means in most cases but was substantial for country standard deviations and percentiles. Consequently, it can be argued that model error should be included in the statistical inference of educational large-scale assessment studies.
Collapse
Affiliation(s)
- Alexander Robitzsch
- IPN—Leibniz Institute for Science and Mathematics Education, Olshausenstraße 62, 24118 Kiel, Germany;
- Centre for International Student Assessment (ZIB), Olshausenstraße 62, 24118 Kiel, Germany
| |
Collapse
|
5
|
Estimation Methods of the Multiple-Group One-Dimensional Factor Model: Implied Identification Constraints in the Violation of Measurement Invariance. AXIOMS 2022. [DOI: 10.3390/axioms11030119] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
Abstract
Factor analysis is one of the most important statistical tools for analyzing multivariate data (i.e., items) in the social sciences. An essential case is the comparison of multiple groups on a one-dimensional factor variable that can be interpreted as a summary of the items. The assumption of measurement invariance is a frequently employed assumption that enables the comparison of the factor variable across groups. This article discusses different estimation methods of the multiple-group one-dimensional factor model under violations of measurement invariance (i.e., measurement noninvariance). In detail, joint estimation, linking methods, and regularized estimation approaches are treated. It is argued that linking approaches and regularization approaches can be equivalent to joint estimation approaches if appropriate (robust) loss functions are employed. Each of the estimation approaches defines identification constraints of parameters that quantify violations of measurement invariance. We argue in the discussion section that the fitted multiple-group one-dimensional factor analysis will likely be misspecified due to the violation of measurement invariance. Hence, because there is always indeterminacy in determining group comparisons of the factor variable under noninvariance, the preference of particular fitting strategies such as partial invariance over alternatives is unjustified. In contrast, researchers purposely define fitting functions that minimize the extent of model misspecification due to the choice of a particular (robust) loss function.
Collapse
|
6
|
Robitzsch A. On the Treatment of Missing Item Responses in Educational Large-Scale Assessment Data: An Illustrative Simulation Study and a Case Study Using PISA 2018 Mathematics Data. Eur J Investig Health Psychol Educ 2021; 11:1653-1687. [PMID: 34940395 PMCID: PMC8700118 DOI: 10.3390/ejihpe11040117] [Citation(s) in RCA: 6] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/03/2021] [Revised: 11/26/2021] [Accepted: 12/10/2021] [Indexed: 11/17/2022] Open
Abstract
Missing item responses are prevalent in educational large-scale assessment studies such as the programme for international student assessment (PISA). The current operational practice scores missing item responses as wrong, but several psychometricians have advocated for a model-based treatment based on latent ignorability assumption. In this approach, item responses and response indicators are jointly modeled conditional on a latent ability and a latent response propensity variable. Alternatively, imputation-based approaches can be used. The latent ignorability assumption is weakened in the Mislevy-Wu model that characterizes a nonignorable missingness mechanism and allows the missingness of an item to depend on the item itself. The scoring of missing item responses as wrong and the latent ignorable model are submodels of the Mislevy-Wu model. In an illustrative simulation study, it is shown that the Mislevy-Wu model provides unbiased model parameters. Moreover, the simulation replicates the finding from various simulation studies from the literature that scoring missing item responses as wrong provides biased estimates if the latent ignorability assumption holds in the data-generating model. However, if missing item responses are generated such that they can only be generated from incorrect item responses, applying an item response model that relies on latent ignorability results in biased estimates. The Mislevy-Wu model guarantees unbiased parameter estimates if the more general Mislevy-Wu model holds in the data-generating model. In addition, this article uses the PISA 2018 mathematics dataset as a case study to investigate the consequences of different missing data treatments on country means and country standard deviations. Obtained country means and country standard deviations can substantially differ for the different scaling models. In contrast to previous statements in the literature, the scoring of missing item responses as incorrect provided a better model fit than a latent ignorable model for most countries. Furthermore, the dependence of the missingness of an item from the item itself after conditioning on the latent response propensity was much more pronounced for constructed-response items than for multiple-choice items. As a consequence, scaling models that presuppose latent ignorability should be refused from two perspectives. First, the Mislevy-Wu model is preferred over the latent ignorable model for reasons of model fit. Second, in the discussion section, we argue that model fit should only play a minor role in choosing psychometric models in large-scale assessment studies because validity aspects are most relevant. Missing data treatments that countries can simply manipulate (and, hence, their students) result in unfair country comparisons.
Collapse
Affiliation(s)
- Alexander Robitzsch
- IPN—Leibniz Institute for Science and Mathematics Education, University of Kiel, Olshausenstraße 62, 24118 Kiel, Germany;
- Centre for International Student Assessment (ZIB), University of Kiel, Olshausenstraße 62, 24118 Kiel, Germany
| |
Collapse
|
7
|
Robust and Nonrobust Linking of Two Groups for the Rasch Model with Balanced and Unbalanced Random DIF: A Comparative Simulation Study and the Simultaneous Assessment of Standard Errors and Linking Errors with Resampling Techniques. Symmetry (Basel) 2021. [DOI: 10.3390/sym13112198] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/17/2023] Open
Abstract
In this article, the Rasch model is used for assessing a mean difference between two groups for a test of dichotomous items. It is assumed that random differential item functioning (DIF) exists that can bias group differences. The case of balanced DIF is distinguished from the case of unbalanced DIF. In balanced DIF, DIF effects on average cancel out. In contrast, in unbalanced DIF, the expected value of DIF effects can differ from zero and on average favor a particular group. Robust linking methods (e.g., invariance alignment) aim at determining group mean differences that are robust to the presence of DIF. In contrast, group differences obtained from nonrobust linking methods (e.g., Haebara linking) can be affected by the presence of a few DIF effects. Alternative robust and nonrobust linking methods are compared in a simulation study under various simulation conditions. It turned out that robust linking methods are preferred over nonrobust alternatives in the case of unbalanced DIF effects. Moreover, the theory of M-estimation, as an important approach to robust statistical estimation suitable for data with asymmetric errors, is used to study the asymptotic behavior of linking estimators if the number of items tends to infinity. These results give insights into the asymptotic bias and the estimation of linking errors that represent the variability in estimates due to selecting items in a test. Moreover, M-estimation is also used in an analytical treatment to assess standard errors and linking errors simultaneously. Finally, double jackknife and double half sampling methods are introduced and evaluated in a simulation study to assess standard errors and linking errors simultaneously. Half sampling outperformed jackknife estimators for the assessment of variability of estimates from robust linking methods.
Collapse
|
8
|
Arts I, Fang Q, van de Schoot R, Meitinger K. Approximate Measurement Invariance of Willingness to Sacrifice for the Environment Across 30 Countries: The Importance of Prior Distributions and Their Visualization. Front Psychol 2021; 12:624032. [PMID: 34366953 PMCID: PMC8341077 DOI: 10.3389/fpsyg.2021.624032] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/30/2020] [Accepted: 06/21/2021] [Indexed: 11/18/2022] Open
Abstract
Nationwide opinions and international attitudes toward climate and environmental change are receiving increasing attention in both scientific and political communities. An often used way to measure these attitudes is by large-scale social surveys. However, the assumption for a valid country comparison, measurement invariance, is often not met, especially when a large number of countries are being compared. This makes a ranking of countries by the mean of a latent variable potentially unstable, and may lead to untrustworthy conclusions. Recently, more liberal approaches to assessing measurement invariance have been proposed, such as the alignment method in combination with Bayesian approximate measurement invariance. However, the effect of prior variances on the assessment procedure and substantive conclusions is often not well understood. In this article, we tested for measurement invariance of the latent variable "willingness to sacrifice for the environment" using Maximum Likelihood Multigroup Confirmatory Factor Analysis and Bayesian approximate measurement invariance, both with and without alignment optimization. For the Bayesian models, we used multiple priors to assess the impact on the rank order stability of countries. The results are visualized in such a way that the effect of different prior variances and models on group means and rankings becomes clear. We show that even when models appear to be a good fit to the data, there might still be an unwanted impact on the rank ordering of countries. From the results, we can conclude that people in Switzerland and South Korea are most motivated to sacrifice for the environment, while people in Latvia are less motivated to sacrifice for the environment.
Collapse
Affiliation(s)
- Ingrid Arts
- Department of Methodology and Statistics, Faculty of Social Sciences, Utrecht University, Utrecht, Netherlands
| | - Qixiang Fang
- Department of Methodology and Statistics, Faculty of Social Sciences, Utrecht University, Utrecht, Netherlands
| | - Rens van de Schoot
- Department of Methodology and Statistics, Faculty of Social Sciences, Utrecht University, Utrecht, Netherlands
| | - Katharina Meitinger
- Department of Methodology and Statistics, Faculty of Social Sciences, Utrecht University, Utrecht, Netherlands
| |
Collapse
|