1
|
Thissen D. A Review of Some of the History of Factorial Invariance and Differential Item Functioning. MULTIVARIATE BEHAVIORAL RESEARCH 2024:1-25. [PMID: 39264323 DOI: 10.1080/00273171.2024.2396148] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 09/13/2024]
Abstract
The concept of factorial invariance has evolved since it originated in the 1930s as a criterion for the usefulness of the multiple factor model; it has become a form of analysis supporting the validity of inferences about group differences on underlying latent variables. The analysis of differential item functioning (DIF) arose in the literature of item response theory (IRT), where its original purpose was the detection and removal of test items that are differentially difficult for, or biased against, one subpopulation or another. The two traditions merge at the level of the underlying latent variable model, but their separate origins and different purposes have led them to differ in details of terminology and procedure. This review traces some aspects of the histories of the two traditions, ultimately drawing some conclusions about how analysts may draw on elements of both, and how the nature of the research question determines the procedures used. Whether statistical tests are grouped by parameter (as in studies of factorial invariance) or across parameters by variable (as in DIF analysis) depends on the context and is independent of the model, as are subtle aspects of the order of the tests. In any case in which DIF or partial invariance is a possibility, the invariant parameters, or anchor items in DIF analysis, are best selected in an interplay between the statistics and judgment about what is being measured.
Collapse
Affiliation(s)
- David Thissen
- Department of Psychology and Neuroscience, The University of North Carolina at Chapel Hill, Chapel Hill, NC, USA
| |
Collapse
|
2
|
Halpin PF. Differential Item Functioning via Robust Scaling. PSYCHOMETRIKA 2024; 89:796-821. [PMID: 38704430 DOI: 10.1007/s11336-024-09957-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 07/20/2022] [Indexed: 05/06/2024]
Abstract
This paper proposes a method for assessing differential item functioning (DIF) in item response theory (IRT) models. The method does not require pre-specification of anchor items, which is its main virtue. It is developed in two main steps: first by showing how DIF can be re-formulated as a problem of outlier detection in IRT-based scaling and then tackling the latter using methods from robust statistics. The proposal is a redescending M-estimator of IRT scaling parameters that is tuned to flag items with DIF at the desired asymptotic type I error rate. Theoretical results describe the efficiency of the estimator in the absence of DIF and its robustness in the presence of DIF. Simulation studies show that the proposed method compares favorably to currently available approaches for DIF detection, and a real data example illustrates its application in a research context where pre-specification of anchor items is infeasible. The focus of the paper is the two-parameter logistic model in two independent groups, with extensions to other settings considered in the conclusion.
Collapse
Affiliation(s)
- Peter F Halpin
- University of North Carolina at Chapel Hill, 100 E Cameron Ave, Office 1070G, Chapel Hill, NC, 27514, USA.
| |
Collapse
|
3
|
Shan N, Xu PF. Bayesian Adaptive Lasso for Detecting Item-Trait Relationship and Differential Item Functioning in Multidimensional Item Response Theory Models. PSYCHOMETRIKA 2024:10.1007/s11336-024-09998-x. [PMID: 39127801 DOI: 10.1007/s11336-024-09998-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 12/02/2023] [Revised: 07/16/2024] [Indexed: 08/12/2024]
Abstract
In multidimensional tests, the identification of latent traits measured by each item is crucial. In addition to item-trait relationship, differential item functioning (DIF) is routinely evaluated to ensure valid comparison among different groups. The two problems are investigated separately in the literature. This paper uses a unified framework for detecting item-trait relationship and DIF in multidimensional item response theory (MIRT) models. By incorporating DIF effects in MIRT models, these problems can be considered as variable selection for latent/observed variables and their interactions. A Bayesian adaptive Lasso procedure is developed for variable selection, in which item-trait relationship and DIF effects can be obtained simultaneously. Simulation studies show the performance of our method for parameter estimation, the recovery of item-trait relationship and the detection of DIF effects. An application is presented using data from the Eysenck Personality Questionnaire.
Collapse
Affiliation(s)
- Na Shan
- School of Psychology & Key Laboratory of Applied Statistics of MOE, Northeast Normal University, 5268 Renmin Street, Changchun, Jilin, China.
| | - Ping-Feng Xu
- Academy for Advanced Interdisciplinary Studies & Key Laboratory of Applied Statistics of MOE, Northeast Normal University, Changchun, China
- Shanghai Zhangjiang Institute of Mathematics, Shanghai, China
| |
Collapse
|
4
|
Kraus EB, Wild J, Hilbert S. Using Interpretable Machine Learning for Differential Item Functioning Detection in Psychometric Tests. APPLIED PSYCHOLOGICAL MEASUREMENT 2024; 48:167-186. [PMID: 39055539 PMCID: PMC11268249 DOI: 10.1177/01466216241238744] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 07/27/2024]
Abstract
This study presents a novel method to investigate test fairness and differential item functioning combining psychometrics and machine learning. Test unfairness manifests itself in systematic and demographically imbalanced influences of confounding constructs on residual variances in psychometric modeling. Our method aims to account for resulting complex relationships between response patterns and demographic attributes. Specifically, it measures the importance of individual test items, and latent ability scores in comparison to a random baseline variable when predicting demographic characteristics. We conducted a simulation study to examine the functionality of our method under various conditions such as linear and complex impact, unfairness and varying number of factors, unfair items, and varying test length. We found that our method detects unfair items as reliably as Mantel-Haenszel statistics or logistic regression analyses but generalizes to multidimensional scales in a straight forward manner. To apply the method, we used random forests to predict migration backgrounds from ability scores and single items of an elementary school reading comprehension test. One item was found to be unfair according to all proposed decision criteria. Further analysis of the item's content provided plausible explanations for this finding. Analysis code is available at: https://osf.io/s57rw/?view_only=47a3564028d64758982730c6d9c6c547.
Collapse
|
5
|
Wang C, Zhu R. Detecting uniform differential item functioning for continuous response computerized adaptive testing. APPLIED PSYCHOLOGICAL MEASUREMENT 2024; 48:18-37. [PMID: 38327608 PMCID: PMC10846470 DOI: 10.1177/01466216241227544] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 02/09/2024]
Abstract
Evaluating items for potential differential item functioning (DIF) is an essential step to ensuring measurement fairness. In this article, we focus on a specific scenario, namely, the continuous response, severely sparse, computerized adaptive testing (CAT). Continuous responses items are growingly used in performance-based tasks because they tend to generate more information than traditional dichotomous items. Severe sparsity arises when many items are automatically generated via machine learning algorithms. We propose two uniform DIF detection methods in this scenario. The first is a modified version of the CAT-SIBTEST, a non-parametric method that does not depend on any specific item response theory model assumptions. The second is a regularization method, a parametric, model-based approach. Simulation studies show that both methods are effective in correctly identifying items with uniform DIF. A real data analysis is provided in the end to illustrate the utility and potential caveats of the two methods.
Collapse
|
6
|
Wallin G, Chen Y, Moustaki I. DIF Analysis with Unknown Groups and Anchor Items. PSYCHOMETRIKA 2024; 89:267-295. [PMID: 38383880 PMCID: PMC11062998 DOI: 10.1007/s11336-024-09948-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 04/27/2023] [Indexed: 02/23/2024]
Abstract
Ensuring fairness in instruments like survey questionnaires or educational tests is crucial. One way to address this is by a Differential Item Functioning (DIF) analysis, which examines if different subgroups respond differently to a particular item, controlling for their overall latent construct level. DIF analysis is typically conducted to assess measurement invariance at the item level. Traditional DIF analysis methods require knowing the comparison groups (reference and focal groups) and anchor items (a subset of DIF-free items). Such prior knowledge may not always be available, and psychometric methods have been proposed for DIF analysis when one piece of information is unknown. More specifically, when the comparison groups are unknown while anchor items are known, latent DIF analysis methods have been proposed that estimate the unknown groups by latent classes. When anchor items are unknown while comparison groups are known, methods have also been proposed, typically under a sparsity assumption - the number of DIF items is not too large. However, DIF analysis when both pieces of information are unknown has not received much attention. This paper proposes a general statistical framework under this setting. In the proposed framework, we model the unknown groups by latent classes and introduce item-specific DIF parameters to capture the DIF effects. Assuming the number of DIF items is relatively small, an L 1 -regularised estimator is proposed to simultaneously identify the latent classes and the DIF items. A computationally efficient Expectation-Maximisation (EM) algorithm is developed to solve the non-smooth optimisation problem for the regularised estimator. The performance of the proposed method is evaluated by simulation studies and an application to item response data from a real-world educational test.
Collapse
Affiliation(s)
- Gabriel Wallin
- Department of Mathematics and Statistics, Lancaster University, Umeå, Sweden
| | - Yunxiao Chen
- Department of Statistics, London School of Economics and Political Science, Columbia House, Room 5.16 Houghton Street, London, WC2A 2AE, UK.
| | - Irini Moustaki
- Department of Statistics, London School of Economics and Political Science, Columbia House, Room 5.16 Houghton Street, London, WC2A 2AE, UK
| |
Collapse
|
7
|
Chen Y, Li C, Ouyang J, Xu G. DIF Statistical Inference Without Knowing Anchoring Items. PSYCHOMETRIKA 2023; 88:1097-1122. [PMID: 37550561 PMCID: PMC10656337 DOI: 10.1007/s11336-023-09930-9] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 02/24/2022] [Accepted: 07/05/2023] [Indexed: 08/09/2023]
Abstract
Establishing the invariance property of an instrument (e.g., a questionnaire or test) is a key step for establishing its measurement validity. Measurement invariance is typically assessed by differential item functioning (DIF) analysis, i.e., detecting DIF items whose response distribution depends not only on the latent trait measured by the instrument but also on the group membership. DIF analysis is confounded by the group difference in the latent trait distributions. Many DIF analyses require knowing several anchor items that are DIF-free in order to draw inferences on whether each of the rest is a DIF item, where the anchor items are used to identify the latent trait distributions. When no prior information on anchor items is available, or some anchor items are misspecified, item purification methods and regularized estimation methods can be used. The former iteratively purifies the anchor set by a stepwise model selection procedure, and the latter selects the DIF-free items by a LASSO-type regularization approach. Unfortunately, unlike the methods based on a correctly specified anchor set, these methods are not guaranteed to provide valid statistical inference (e.g., confidence intervals and p-values). In this paper, we propose a new method for DIF analysis under a multiple indicators and multiple causes (MIMIC) model for DIF. This method adopts a minimal [Formula: see text] norm condition for identifying the latent trait distributions. Without requiring prior knowledge about an anchor set, it can accurately estimate the DIF effects of individual items and further draw valid statistical inferences for quantifying the uncertainty. Specifically, the inference results allow us to control the type-I error for DIF detection, which may not be possible with item purification and regularized estimation methods. We conduct simulation studies to evaluate the performance of the proposed method and compare it with the anchor-set-based likelihood ratio test approach and the LASSO approach. The proposed method is applied to analysing the three personality scales of the Eysenck personality questionnaire-revised (EPQ-R).
Collapse
Affiliation(s)
- Yunxiao Chen
- London School of Economics and Political Science, London, UK.
| | | | | | | |
Collapse
|
8
|
Cole VT, Hussong AM, Gottfredson NC, Bauer DJ, Curran PJ. Informing Harmonization Decisions in Integrative Data Analysis: Exploring the Measurement Multiverse. PREVENTION SCIENCE : THE OFFICIAL JOURNAL OF THE SOCIETY FOR PREVENTION RESEARCH 2023; 24:1595-1607. [PMID: 36441362 DOI: 10.1007/s11121-022-01466-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 10/27/2022] [Indexed: 11/29/2022]
Abstract
Combining datasets in an integrative data analysis (IDA) requires researchers to make a number of decisions about how best to harmonize item responses across datasets. This entails two sets of steps: logical harmonization, which involves combining items which appear similar across datasets, and analytic harmonization, which involves using psychometric models to find and account for cross-study differences in measurement. Embedded in logical and analytic harmonization are many decisions, from deciding whether items can be combined prima facie to how best to find covariate effects on specific items. Researchers may not have specific hypotheses about these decisions, and each individual choice may seem arbitrary, but the cumulative effects of these decisions are unknown. In the current study, we conducted an IDA of the relationship between alcohol use and delinquency using three datasets (total N = 2245). For analytic harmonization, we used moderated nonlinear factor analysis (MNLFA) to generate factor scores for delinquency. We conducted both logical and analytic harmonization 72 times, each time making a different set of decisions. We assessed the cumulative influence of these decisions on MNLFA parameter estimates, factor scores, and estimates of the relationship between delinquency and alcohol use. There were differences across paths in MNLFA parameter estimates, but fewer differences in estimates of factor scores and regression parameters linking delinquency to alcohol use. These results suggest that factor scores may be relatively robust to subtly different decisions in data harmonization, and measurement model parameters are less so.
Collapse
Affiliation(s)
- Veronica T Cole
- Department of Psychology, Wake Forest University, 1834 Wake Forest Road, Winston-Salem, NC, 27109, USA.
| | - Andrea M Hussong
- Department of Psychology and Neuroscience, University of North Carolina at Chapel Hill, Chapel Hill, NC, USA
| | - Nisha C Gottfredson
- Department of Health Behavior, Gillings School of Global Public Health, University of North Carolina at Chapel Hill, Chapel Hill, NC, USA
| | - Daniel J Bauer
- Department of Psychology and Neuroscience, University of North Carolina at Chapel Hill, Chapel Hill, NC, USA
| | - Patrick J Curran
- Department of Psychology and Neuroscience, University of North Carolina at Chapel Hill, Chapel Hill, NC, USA
| |
Collapse
|
9
|
Perlstein S, Wagner N, Domínguez-Álvarez B, Gómez-Fraguela JA, Romero E, Lopez-Romero L, Waller R. Psychometric Properties, Factor Structure, and Validity of the Sensitivity to Threat and Affiliative Reward Scale in Children and Adults. Assessment 2023; 30:1914-1934. [PMID: 36245403 PMCID: PMC10687739 DOI: 10.1177/10731911221128946] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022]
Abstract
Callous-Unemotional (CU) traits identify children at high risk of antisocial behavior. A recent theoretical model proposed that CU traits arise from low sensitivity to threat and affiliation. To assess these dimensions, we developed the parent- and self-reported Sensitivity to Threat and Affiliative Reward Scale (STARS) and tested its psychometric properties, factor structure, and construct validity. Samples 1 (N =3 03; age 3-10; United States) and 2 (N = 854 age 5-9; Spain) were children and Sample 3 was 514 young adults (Mage = 19.89; United States). In Sample 1, differential item functioning and item response theory techniques were used to identify the best-performing items from a 64-item pool, resulting in 28 items that functioned equivalently across age and gender. Factor analysis indicated acceptable fit for the theorized two-factor structure with separate threat and affiliation factors in all three samples, which showed predictive validity in relation to CU traits in children and psychopathic traits in young adults.
Collapse
|
10
|
Wang C, Zhu R, Xu G. Using Lasso and Adaptive Lasso to Identify DIF in Multidimensional 2PL Models. MULTIVARIATE BEHAVIORAL RESEARCH 2023; 58:387-407. [PMID: 35086405 DOI: 10.1080/00273171.2021.1985950] [Citation(s) in RCA: 4] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/26/2023]
Abstract
Differential item functioning (DIF) analysis refers to procedures that evaluate whether an item's characteristic differs for different groups of persons after controlling for overall differences in performance. DIF is routinely evaluated as a screening step to ensure items behave the same across groups. Currently, the majority DIF studies focus predominately on unidimensional IRT models, although multidimensional IRT (MIRT) models provide a powerful tool for enriching the information gained in modern assessment. In this study, we explore regularization methods for DIF detection in MIRT models and compare their performance to the classic likelihood ratio test. Regularization methods have recently emerged as a new family of methods for DIF detection due to their advantages: (1) they bypass the tedious iterative purification procedure that is often needed in other methods for identifying anchor items, and (2) they can handle multiple covariates simultaneously. The specific regularization methods considered in the study are: lasso with expectation-maximization (EM), lasso with expectation-maximization-maximization (EMM) algorithm, and adaptive lasso with EM. Simulation results show that lasso EMM and adaptive lasso EM hold great promise when the sample size is large, and they both outperform lasso EM. A real data example from PROMIS depression and anxiety scales is presented in the end.
Collapse
|
11
|
Leitgöb H, Seddig D, Asparouhov T, Behr D, Davidov E, De Roover K, Jak S, Meitinger K, Menold N, Muthén B, Rudnev M, Schmidt P, van de Schoot R. Measurement invariance in the social sciences: Historical development, methodological challenges, state of the art, and future perspectives. SOCIAL SCIENCE RESEARCH 2023; 110:102805. [PMID: 36796989 DOI: 10.1016/j.ssresearch.2022.102805] [Citation(s) in RCA: 9] [Impact Index Per Article: 9.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 05/17/2022] [Revised: 09/30/2022] [Accepted: 10/02/2022] [Indexed: 06/18/2023]
Abstract
This review summarizes the current state of the art of statistical and (survey) methodological research on measurement (non)invariance, which is considered a core challenge for the comparative social sciences. After outlining the historical roots, conceptual details, and standard procedures for measurement invariance testing, the paper focuses in particular on the statistical developments that have been achieved in the last 10 years. These include Bayesian approximate measurement invariance, the alignment method, measurement invariance testing within the multilevel modeling framework, mixture multigroup factor analysis, the measurement invariance explorer, and the response shift-true change decomposition approach. Furthermore, the contribution of survey methodological research to the construction of invariant measurement instruments is explicitly addressed and highlighted, including the issues of design decisions, pretesting, scale adoption, and translation. The paper ends with an outlook on future research perspectives.
Collapse
Affiliation(s)
- Heinz Leitgöb
- University of Leipzig, Germany; University of Frankfurt, Germany.
| | - Daniel Seddig
- University of Cologne, Germany; University of Münster, Germany
| | | | - Dorothée Behr
- GESIS - Leibniz Institute for the Social Sciences, Germany
| | - Eldad Davidov
- University of Cologne, Germany; University of Zurich and URPP Social Networks, Switzerland
| | - Kim De Roover
- Tilburg University, the Netherlands; KU Leuven, Belgium
| | | | | | | | | | | | - Peter Schmidt
- University of Giessen, Germany; University of Mainz, Germany
| | | |
Collapse
|
12
|
Regularized Mixture Rasch Model. INFORMATION 2022. [DOI: 10.3390/info13110534] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
The mixture Rasch model is a popular mixture model for analyzing multivariate binary data. The drawback of this model is that the number of estimated parameters substantially increases with an increasing number of latent classes, which, in turn, hinders the interpretability of model parameters. This article proposes regularized estimation of the mixture Rasch model that imposes some sparsity structure on class-specific item difficulties. We illustrate the feasibility of the proposed modeling approach by means of one simulation study and two simulated case studies.
Collapse
|
13
|
A Machine Learning Approach to Assess Differential Item Functioning of the KINDL Quality of Life Questionnaire Across Children with and Without ADHD. Child Psychiatry Hum Dev 2022; 53:980-991. [PMID: 33963488 DOI: 10.1007/s10578-021-01179-6] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Accepted: 04/25/2021] [Indexed: 10/21/2022]
Abstract
This study aimed to investigate differential item functioning (DIF) of the child and parent reports of the KINDL measure across children with and without Attention-deficit/hyperactivity disorder (ADHD). The sample included 122 children with ADHD and 1086 healthy peers, alongside 127 and 1061 of their parents, respectively. The generalized partial credit model with lasso penalization, as a machine learning method, was used to assess DIF of the KINDL across the two groups. The findings showed that three out of 24 items of the child reports and seven out of 24 items of the parent reports of the KINDL exhibited DIF between children with and without ADHD. Accordingly, Iranian children with and without ADHD along with their parents perceive almost all items in the KINDL similarly. Hence, the observed difference in quality of life scores between children with and without ADHD is a real difference and not a reflection of measurement bias.
Collapse
|
14
|
Schirmbeck K, Runge R, Rao N, Wang R, Richards B, Chan SWY, Maehler C. Assessing executive functions in preschoolers in Germany and Hong Kong: testing for measurement invariance. JOURNAL OF CULTURAL COGNITIVE SCIENCE 2022. [DOI: 10.1007/s41809-022-00112-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]
|
15
|
Stevens AK, Janssen T, Belzak WC, Padovano HT, Jackson KM. Comprehensive measurement invariance of alcohol outcome expectancies among adolescents using regularized moderated nonlinear factor analysis. Addict Behav 2022; 124:107088. [PMID: 34487979 PMCID: PMC8805203 DOI: 10.1016/j.addbeh.2021.107088] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/07/2021] [Revised: 06/29/2021] [Accepted: 08/15/2021] [Indexed: 01/03/2023]
Abstract
Alcohol outcomes expectancies (AOEs) are robust predictors of alcohol initiation and escalation of drinking behavior among adolescents. Although measurement invariance is a prerequisite for inferring valid comparisons of AOEs across groups (e.g., age), empirical evidence is lacking. In a secondary data analysis study, we employed regularized moderated nonlinear factor analysis (MNLFA) to simultaneously test differential item functioning (DIF) across age, sex, race, ethnicity, socioeconomic status (SES), and alcohol initiation for a 22-item, two-factor measure of positive and negative AOEs among adolescents (analytic n = 936 drawn from a parent study of 1023 adolescents). Evidence of DIF was minimal, with no DIF for the negative AOE factor and DIF for only two items of the positive AOE factor. The item "feel grown up" exhibited DIF by age, and the item "feel romantic" exhibited DIF by SES. After accounting for DIF, the positive AOE latent factor mean differed by SES, age, and alcohol initiation, and exhibited lower variability by alcohol initiation. The negative AOE latent factor mean differed by sex and SES, with greater variability by SES and age and lower variability by alcohol initiation. Group-differences findings for age and alcohol initiation are consistent with prior work, and differences by sex and SES are a new contribution to the literature that should prompt additional research to ensure replicability. The present study demonstrates the utility of the MNLFA technique for examining comprehensive measurement invariance, particularly for applied researchers who seek to examine substantive research questions while accounting for any DIF present in the scales used.
Collapse
Affiliation(s)
- Angela K. Stevens
- Center for Alcohol and Addiction Studies, Brown University School of Public Health, Providence, RI 02903, USA,Corresponding author at: Center for Alcohol and Addiction Studies, Brown University, Box G-S121-4, Providence, RI 02912, USA. (A.K. Stevens)
| | - Tim Janssen
- Center for Alcohol and Addiction Studies, Brown University School of Public Health, Providence, RI 02903, USA
| | - William C.M. Belzak
- Department of Psychology and Neuroscience, University of North Carolina at Chapel Hill, Chapel Hill, NC 27599, USA
| | - Hayley Treloar Padovano
- Center for Alcohol and Addiction Studies, Brown University School of Public Health, Providence, RI 02903, USA
| | - Kristina M. Jackson
- Center for Alcohol and Addiction Studies, Brown University School of Public Health, Providence, RI 02903, USA
| |
Collapse
|
16
|
A Machine Learning Approach to Assess Differential Item Functioning in Psychometric Questionnaires Using the Elastic Net Regularized Ordinal Logistic Regression in Small Sample Size Groups. BIOMED RESEARCH INTERNATIONAL 2021; 2021:6854477. [PMID: 34957307 PMCID: PMC8695002 DOI: 10.1155/2021/6854477] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 08/23/2021] [Accepted: 11/29/2021] [Indexed: 11/18/2022]
Abstract
Assessing differential item functioning (DIF) using the ordinal logistic regression (OLR) model highly depends on the asymptotic sampling distribution of the maximum likelihood (ML) estimators. The ML estimation method, which is often used to estimate the parameters of the OLR model for DIF detection, may be substantially biased with small samples. This study is aimed at proposing a new application of the elastic net regularized OLR model, as a special type of machine learning method, for assessing DIF between two groups with small samples. Accordingly, a simulation study was conducted to compare the powers and type I error rates of the regularized and nonregularized OLR models in detecting DIF under various conditions including moderate and severe magnitudes of DIF (DIF = 0.4 and 0.8), sample size (N), sample size ratio (R), scale length (I), and weighting parameter (w). The simulation results revealed that for I = 5 and regardless of R, the elastic net regularized OLR model with w = 0.1, as compared with the nonregularized OLR model, increased the power of detecting moderate uniform DIF (DIF = 0.4) approximately 35% and 21% for N = 100 and 150, respectively. Moreover, for I = 10 and severe uniform DIF (DIF = 0.8), the average power of the elastic net regularized OLR model with 0.03 ≤ w ≤ 0.06, as compared with the nonregularized OLR model, increased approximately 29.3% and 11.2% for N = 100 and 150, respectively. In these cases, the type I error rates of the regularized and nonregularized OLR models were below or close to the nominal level of 0.05. In general, this simulation study showed that the elastic net regularized OLR model outperformed the nonregularized OLR model especially in extremely small sample size groups. Furthermore, the present research provided a guideline and some recommendations for researchers who conduct DIF studies with small sample sizes.
Collapse
|
17
|
Somaraju AV, Nye CD, Olenick J. A Review of Measurement Equivalence in Organizational Research: What's Old, What's New, What's Next? ORGANIZATIONAL RESEARCH METHODS 2021. [DOI: 10.1177/10944281211056524] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/17/2022]
Abstract
The study of measurement equivalence has important implications for organizational research. Nonequivalence across groups or over time can affect the results of a study and the conclusions that are drawn from it. As a result, the review paper by Vandenberg & Lance (2000) has been highly cited and has played an important role in understanding the measurement of organizational constructs. However, that paper is now 20 years old, and a number of advances have been made in the application and interpretation of measurement equivalence (ME) since its publication. Therefore, the goal of the present paper is to provide an updated review of ME techniques that describes recent advances in testing for ME and proposes a taxonomy of potential sources of nonequivalence. Finally, we articulate recommendations for applying these newer methods and consider future directions for measurement equivalence research in the organizational literature.
Collapse
|
18
|
Wang M, Reeve BB. Evaluations of the sum-score-based and item response theory-based tests of group mean differences under various simulation conditions. Stat Methods Med Res 2021; 30:2604-2618. [PMID: 34617840 PMCID: PMC8649417 DOI: 10.1177/09622802211043263] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/02/2022]
Abstract
The use of patient-reported outcomes measures is gaining popularity in clinical
trials for comparing patient groups. Such comparisons typically focus on the
differences in group means and are carried out using either a traditional
sum-score-based approach or item response theory (IRT)-based approaches. Several
simulation studies have evaluated different group mean comparison approaches in
the past, but the performance of these approaches remained unknown under certain
uninvestigated conditions (e.g. under the impact of differential item
functioning (DIF)). By incorporating some of the uninvestigated simulation
features, the current study examines Type I error, statistical power, and effect
size estimation accuracy associated with group mean comparisons using simple sum
scores, IRT model likelihood ratio tests, and IRT expected-a-posteriori scores.
Manipulated features include sample size per group, number of items, number of
response categories, strength of discrimination parameters, location of
thresholds, impact of DIF, and presence of missing data. Results are summarized
and visualized using decision trees.
Collapse
Affiliation(s)
- Mian Wang
- Lineberger Comprehensive Cancer Center, 2331University of North Carolina at Chapel Hill, Carrboro, NC, USA
| | - Bryce B Reeve
- Department of Population Health Sciences, 3065Duke University School of Medicine, Durham, NC, USA
| |
Collapse
|
19
|
The net worth of networks and extraversion: Examining personality structure through network models. PERSONALITY AND INDIVIDUAL DIFFERENCES 2021. [DOI: 10.1016/j.paid.2021.111039] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Indexed: 11/20/2022]
|
20
|
Yuan KH, Liu H, Han Y. Differential Item Functioning Analysis Without A Priori Information on Anchor Items: QQ Plots and Graphical Test. PSYCHOMETRIKA 2021; 86:345-377. [PMID: 33656627 DOI: 10.1007/s11336-021-09746-5] [Citation(s) in RCA: 6] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 12/23/2019] [Revised: 12/30/2020] [Accepted: 01/08/2021] [Indexed: 06/12/2023]
Abstract
Differential item functioning (DIF) analysis is an important step in establishing the validity of measurements. Most traditional methods for DIF analysis use an item-by-item strategy via anchor items that are assumed DIF-free. If anchor items are flawed, these methods will yield misleading results due to biased scales. In this article, based on the fact that the item's relative change of difficulty difference (RCD) does not depend on the mean ability of individual groups, a new DIF detection method (RCD-DIF) is proposed by comparing the observed differences against those with simulated data that are known DIF-free. The RCD-DIF method consists of a D-QQ (quantile quantile) plot that permits the identification of internal references points (similar to anchor items), a RCD-QQ plot that facilitates visual examination of DIF, and a RCD graphical test that synchronizes DIF analysis at the test level with that at the item level via confidence intervals on individual items. The RCD procedure visually reveals the overall pattern of DIF in the test and the size of DIF for each item and is expected to work properly even when the majority of the items possess DIF and the DIF pattern is unbalanced. Results of two simulation studies indicate that the RCD graphical test has Type I error rate comparable to those of existing methods but with greater power.
Collapse
Affiliation(s)
- Ke-Hai Yuan
- Department of Psychology, University of Notre Dame, Notre Dame, IN, 46656, USA
| | - Hongyun Liu
- Faculty of Psychology, Beijing Normal University, No. 19, XinJieKouWai St., HaiDian District, Beijing , 100875, People's Republic of China.
| | - Yuting Han
- Faculty of Psychology, Beijing Normal University, No. 19, XinJieKouWai St., HaiDian District, Beijing , 100875, People's Republic of China
| |
Collapse
|
21
|
Petersen IT, Choe DE, LeBeau B. Studying a Moving Target in Development: The Challenge and Opportunity of Heterotypic Continuity. DEVELOPMENTAL REVIEW 2020; 58:100935. [PMID: 33244192 PMCID: PMC7685252 DOI: 10.1016/j.dr.2020.100935] [Citation(s) in RCA: 12] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/26/2023]
Abstract
Many psychological constructs show heterotypic continuity-their behavioral manifestations change with development but their meaning remains the same (e.g., externalizing problems). However, research has paid little attention to how to account for heterotypic continuity. Conceptual and methodological challenges of heterotypic continuity may prevent researchers from examining lengthy developmental spans. Developmental theory requires that measurement accommodate changes in manifestation of constructs. Simulation and empirical work demonstrate that failure to account for heterotypic continuity when collecting or analyzing longitudinal data results in faulty developmental inferences. Accounting for heterotypic continuity may require using different measures across time with approaches that link measures on a comparable scale. Creating a developmental scale (i.e., developmental scaling) is recommended to link measures across time and account for heterotypic continuity, which is crucial in understanding development across the lifespan. The current synthesized review defines heterotypic continuity, describes how to identify it, and presents solutions to account for it. We note challenges of addressing heterotypic continuity, and propose steps in leveraging opportunities it creates to advance empirical study of development.
Collapse
|
22
|
Abstract
The comparison of group means in latent variable models plays a vital role in empirical research in the social sciences. The present article discusses an extension of invariance alignment and Haberman linking by choosing the robust power loss function ρ(x)=|x|p(p>0). This power loss function with power values p smaller than one is particularly suited for item responses that are generated under partial invariance. For a general class of linking functions, asymptotic normality of estimates is shown. Moreover, the theory of M-estimation is applied for obtaining linking errors (i.e., inference with respect to a population of items) for this class of linking functions. In a simulation study, it is shown that invariance alignment and Haberman linking have comparable performance, and in some conditions, the newly proposed robust Haberman linking outperforms invariance alignment. In three examples, the influence of the choice of a particular linking function on the estimation of group means is demonstrated. It is concluded that the choice of the loss function in linking is related to structural assumptions about the pattern of noninvariance in item parameters.
Collapse
|
23
|
Abstract
The comparison of group means in item response models constitutes an important issue in empirical research. The present article discusses a slight extension of the robust Haebara linking approach of He and Cui by proposing a flexible class of robust Haebara linking functions for comparisons of many groups. These robust linking functions are robust against violations of invariance. In this article, we investigate the performance of robust Haebara linking in the presence of uniform DIF effects. In an analytical derivation, it is shown that the robust Haebara linking approach provides unbiased estimates of group means in the limiting case p=0. In a simulation study, it is demonstrated that the proposed variant of the Haebara linking approach outperforms existing implementations of Haebara linking to some extent. In an empirical application using PISA data, it is illustrated that country means can be sensitive to the choice of linking functions.
Collapse
|
24
|
Addressing missing data in specification search in measurement invariance testing with Likert-type scale variables: A comparison of two approaches. Behav Res Methods 2020; 52:2567-2587. [PMID: 32495029 DOI: 10.3758/s13428-020-01415-2] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/30/2023]
Abstract
In measurement invariance testing, when a certain level of full invariance is not achieved, the sequential backward specification search method with the largest modification index (SBSS_LMFI) is often used to identify the source of non-invariance. SBSS_LMFI has been studied under complete data but not missing data. Focusing on Likert-type scale variables, this study examined two methods for dealing with missing data in SBSS_LMFI using Monte Carlo simulation: robust full information maximum likelihood estimator (rFIML) and mean and variance adjusted weighted least squared estimator coupled with pairwise deletion (WLSMV_PD). The result suggests that WLSMV_PD could result in not only over-rejections of invariance models but also reductions of power to identify non-invariant items. In contrast, rFIML provided good control of type I error rates, although it required a larger sample size to yield sufficient power to identify non-invariant items. Recommendations based on the result were provided.
Collapse
|