1
|
Jović M, Amir-Haeri M, Rimfeld K, Ensink JBM, Lindauer RJL, Vrijkotte TGM, Whitehouse A, van den Berg SM. Harmonization of SDQ and ASEBA Phenotypes: Measurement Variance Across Cohorts. JOURNAL OF PSYCHOPATHOLOGY AND BEHAVIORAL ASSESSMENT 2025; 47:27. [PMID: 40062209 PMCID: PMC11889055 DOI: 10.1007/s10862-025-10204-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 02/18/2025] [Indexed: 03/28/2025]
Abstract
Harmonizing the scores obtained by different instruments that measure the same construct enable researchers to combine them in one analysis. An important step in harmonization is checking whether there is measurement invariance across populations. This study aimed to examine whether the harmonized scores for anxiety/depression and ADHD obtained by two different instruments (the Child Behaviour Check List (CBCL) and the Strength and Difficulties Questionnaire (SDQ)) are measurement invariant across other countries, languages, and age groups. We used cohorts from Australia (1330 children aged 10-11.5 years), the Netherlands (943 children aged 11-13.5 years) and the United Kingdom (4504 children aged 14-19). We used the Bayesian method for modeling measurement non-invariance proposed by Verhagen and Fox, 2013a that we adapted for using on polytomous items and in a relatively small number of groups (cohorts). Results showed that there is hardly any differential functioning of harmonized anxiety/depression and ADHD scores obtained by CBCL and SDQ across cohorts. The same model that harmonizes measures in Australian 10-year-old children can also be used in cohorts from the UK and the Netherlands. Supplementary Information The online version contains supplementary material available at 10.1007/s10862-025-10204-0.
Collapse
Affiliation(s)
- Miljan Jović
- Department of Learning, Data Analytics and Technology, Faculty of Behavioural, Management and Social Sciences, University of Twente, PO Box 217, Enschede, 7500 AE The Netherlands
| | - Maryam Amir-Haeri
- Department of Learning, Data Analytics and Technology, Faculty of Behavioural, Management and Social Sciences, University of Twente, PO Box 217, Enschede, 7500 AE The Netherlands
| | - Kaili Rimfeld
- Social, Genetic and Developmental Psychiatry Centre, Institute of Psychiatry, Psychology & Neuroscience, King’s College London, London, UK
- Royal Holloway University of London, London, UK
| | - Judith B. M. Ensink
- Location AMC, Department of Child and Adolescent Psychiatry, Amsterdam University Medical Center, Amsterdam, The Netherlands
- Location AMC, Department of Clinical Genetics, Amsterdam University Medical Center, Genome Diagnostics Laboratory, Amsterdam, The Netherlands
- Academic Centre for Child and Adolescent Psychiatry, Amsterdam, The Netherlands
| | - Ramon J. L. Lindauer
- Location AMC, Department of Child and Adolescent Psychiatry, Amsterdam University Medical Center, Amsterdam, The Netherlands
- Academic Centre for Child and Adolescent Psychiatry, Amsterdam, The Netherlands
| | - Tanja G. M. Vrijkotte
- Location AMC, Department of Public and Occupational Health, Amsterdam University Medical Center, University of Amsterdam, Amsterdam, The Netherlands
- Amsterdam Public Health Research Institute, Amsterdam, The Netherlands
| | - Andrew Whitehouse
- Telethon Kids Institute, University of Western Australia, Perth, Australia
| | - Stéphanie M. van den Berg
- Department of Learning, Data Analytics and Technology, Faculty of Behavioural, Management and Social Sciences, University of Twente, PO Box 217, Enschede, 7500 AE The Netherlands
| |
Collapse
|
2
|
Buchanan EM. Visualizemi: Visualization, Effect Size, and Replication of Measurement Invariance for Registered Reports. Assessment 2025; 32:190-205. [PMID: 39473061 DOI: 10.1177/10731911241280763] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 03/03/2025]
Abstract
Latent variable modeling as a lens for psychometric theory is a popular tool for social scientists to examine measurement of constructs. Journals, such as Assessment regularly publish articles supporting measures of latent constructs wherein a measurement model is established. Confirmatory factor analysis can be used to investigate the replicability and generalizability of the measurement model in new samples, while multigroup confirmatory factor analysis is used to examine the measurement model across groups within samples. With the rise of the replication crisis and "psychology's renaissance," interest in divergence in measurement has increased, often focused on small parameter differences within the latent model. This article presents visualizemi, an R package that provides functionality to calculate multigroup models, partial invariance, visualizations for (non)-invariance, effect sizes for models and parameters, and potential replication rates compared with random models. Readers will learn how to interpret the impact and size of the proposed non-invariance in models with a focus on potential replication and how to plan for registered reports.
Collapse
|
3
|
Ozcan M, Lai MHC. Exploring the Impact of Deleting (or Retaining) a Biased Item: A Procedure Based on Classification Accuracy. Assessment 2024:10731911241298081. [PMID: 39655755 DOI: 10.1177/10731911241298081] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/17/2024]
Abstract
Psychological test scores are commonly used in high-stakes settings to classify individuals. While measurement invariance across groups is necessary for valid and meaningful inferences of group differences, full measurement invariance rarely holds in practice. The classification accuracy analysis framework aims to quantify the degree and practical impact of noninvariance. However, how to best navigate the next steps remains unclear, and methods devised to account for noninvariance at the group level may be insufficient when the goal is classification. Furthermore, deleting a biased item may improve fairness but negatively affect performance, and replacing the test can be costly. We propose item-level effect size indices that allow test users to make more informed decisions by quantifying the impact of deleting (or retaining) an item on test performance and fairness, provide an illustrative example, and introduce unbiasr, an R package implementing the proposed methods.
Collapse
Affiliation(s)
- Meltem Ozcan
- University of Southern California, Los Angeles, USA
| | - Mark H C Lai
- University of Southern California, Los Angeles, USA
| |
Collapse
|
4
|
Oeffinger DJ, Iwinski H, Talwalkar V, Dueber DM. Psychometric analysis and the implications for the use of the scoliosis research society questionnaire (SRS-22r English) for individuals with adolescent idiopathic scoliosis. NORTH AMERICAN SPINE SOCIETY JOURNAL 2024; 19:100545. [PMID: 39290847 PMCID: PMC11405851 DOI: 10.1016/j.xnsj.2024.100545] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 05/10/2024] [Revised: 07/26/2024] [Accepted: 07/27/2024] [Indexed: 09/19/2024]
Abstract
Background Despite widespread usage of the SRS-22r questionnaire (Scoliosis Research Society Questionnaire-22r), the English version has only sparingly been subjected to analysis using modern psychometric techniques for patients with adolescent idiopathic scoliosis (AIS). The study purpose was to improve interpretation and clinical utility of the SRS-22r for adolescents with AIS by generating additional robust evidence, using modern statistical techniques. Questions about (1) Structure and (2) Item and Scale Functioning are addressed and interpreted for clinicians and researchers. Methods This retrospective case review analyzed SRS-22r data collected from 1823 patients (mean age 14.9±2.2years) with a primary diagnosis of AIS who clinically completed an SRS-22r questionnaire.Individual SRS-22r questions and domain scores were retrieved through data queries. Patient information collected through chart review included diagnosis, age at assessment, sex, race and radiographic parameters. From 6044 SRS-22r assessments, 1 assessment per patient was randomly selected. Exploratory structural equation modeling (ESEM) and item response theory (IRT) techniques were used for data modeling, item calibration, and reliability assessment. Results ESEM demonstrated acceptable fit to the data: χ2 (130)=343.73, p<.001; RMSEA=0.035; CFI=0.98; TLI=0.96; SRMR=0.02. Several items failed to adequately load onto their assigned factor. Item fit was adequate for all items except SRSq10 (Self-Image), SRSq16 (Mental Health), and SRSq20 (Mental Health). IRT models found item discriminations are within normal levels for items in psychological measures, except items SRSq1 (pain), SRSq2 (pain), and SRSq16 (mental health). Estimated reliability of the Function domain (ρ=0.69) was low, however, Pain, Self-Image and Mental Health domains exhibited high (ρ>0.80) reliability. Conclusions Modern psychometric assessment of the SRS-22r, in adolescent patients with AIS, are presented and interpreted to assist clinicians and researchers in understanding its strengths and limitations. Overall, the SRS-22r demonstrated good psychometric properties in all domains except function. Cautious interpretation of the total score is suggested, as it does not reflect a single HRQoL construct.
Collapse
Affiliation(s)
- Donna J Oeffinger
- Shriners Children's Lexington, Lexington, KY, 110 Conn Terrace, Lexington, KY 40508, United States
| | - Henry Iwinski
- Shriners Children's Lexington, Lexington, KY, 110 Conn Terrace, Lexington, KY 40508, United States
| | - Vishwas Talwalkar
- Shriners Children's Lexington, Lexington, KY, 110 Conn Terrace, Lexington, KY 40508, United States
| | - David M Dueber
- The Herb Innovation Center, University of Toledo, 3100 Gillham Hall, Toledo, OH 43606, United States
| |
Collapse
|
5
|
Widaman KF, Revelle W. Thinking About Sum Scores Yet Again, Maybe the Last Time, We Don't Know, Oh No . . .: A Comment on. EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 2024; 84:637-659. [PMID: 39055096 PMCID: PMC11268387 DOI: 10.1177/00131644231205310] [Citation(s) in RCA: 5] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 07/27/2024]
Abstract
The relative advantages and disadvantages of sum scores and estimated factor scores are issues of concern for substantive research in psychology. Recently, while championing estimated factor scores over sum scores, McNeish offered a trenchant rejoinder to an article by Widaman and Revelle, which had critiqued an earlier paper by McNeish and Wolf. In the recent contribution, McNeish misrepresented a number of claims by Widaman and Revelle, rendering moot his criticisms of Widaman and Revelle. Notably, McNeish chose to avoid confronting a key strength of sum scores stressed by Widaman and Revelle-the greater comparability of results across studies if sum scores are used. Instead, McNeish pivoted to present a host of simulation studies to identify relative strengths of estimated factor scores. Here, we review our prior claims and, in the process, deflect purported criticisms by McNeish. We discuss briefly issues related to simulated data and empirical data that provide evidence of strengths of each type of score. In doing so, we identified a second strength of sum scores: superior cross-validation of results across independent samples of empirical data, at least for samples of moderate size. We close with consideration of four general issues concerning sum scores and estimated factor scores that highlight the contrasts between positions offered by McNeish and by us, issues of importance when pursuing applied research in our field.
Collapse
|
6
|
Black L, Humphrey N, Panayiotou M, Marquez J. Mental Health and Well-being Measures for Mean Comparison and Screening in Adolescents: An Assessment of Unidimensionality and Sex and Age Measurement Invariance. Assessment 2024; 31:219-236. [PMID: 36864693 PMCID: PMC10822075 DOI: 10.1177/10731911231158623] [Citation(s) in RCA: 5] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 03/04/2023]
Abstract
Adolescence is a period of increased vulnerability for low well-being and mental health problems, particularly for girls and older adolescents. Accurate measurement via brief self-report is therefore vital to understanding prevalence, group trends, screening efforts, and response to intervention. We drew on data from the #BeeWell study (N = 37,149, aged 12-15) to consider whether sum-scoring, mean comparisons, and deployment for screening were likely to show bias for eight such measures. Evidence for unidimensionality, considering dynamic fit confirmatory factor models, exploratory graph analysis, and bifactor modeling, was found for five measures. Of these five, most showed a degree of non-invariance across sex and age likely incompatible with mean comparison. Effects on selection were minimal, except sensitivity was substantially lower in boys for the internalizing symptoms measure. Measure-specific insights are discussed, as are general issues highlighted by our analysis, such as item reversals and measurement invariance.
Collapse
|
7
|
DeCarlo M, Bean G. Assessing Measurement Invariance in ASWB Exams: Regulatory Research Proposal to Advance Equity. JOURNAL OF EVIDENCE-BASED SOCIAL WORK (2019) 2024; 21:214-235. [PMID: 38345106 DOI: 10.1080/26408066.2024.2308814] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 03/18/2024]
Abstract
PURPOSE Social workers from minoritized racial, ethnic, linguistic, and age groups are far less likely to pass licensing examinations required to practice. Using a simulated data set, our study investigates measurement equivalence, or invariance, of social work licensing exams. MATERIALS For this analysis, we simulated responses to 15 multiple-choice questions which were scored as either correct or incorrect using the R mirt package and used mirt to fit a 2-parameter logistic model (2PL) to the response data. We generated the data so that five items could demonstrate DIF and calculated their impact on the test characteristic curves and item characteristic curves. RESULTS Small amounts of differential item functioning added up into differential test functioning, but the effect size was small. This result is one potential outcome of an analysis of ASWB exams. DISCUSSION Most studies evaluating test characteristic curves demonstrate small effect sizes. Measuring the test characteristic curve and the test information curve will help to investigate content-irrelevant sources of variance in the exams, including unfairness, unreliability, and invalid pass scores. CONCLUSION Differential test functioning is a core part of measurement invariance studies. Psychometric standards require test developers to assess measurement invariance at both the item-level and test-level to protect themselves from accusations of bias.
Collapse
Affiliation(s)
- Matthew DeCarlo
- College of Education and Human Development, Saint Joseph's University, Philadelphia, Pennsylvania, USA
| | - Gerald Bean
- College of Social Work, College of Education & Human Ecology, The Ohio State University, Columbus, Ohio, USA
| |
Collapse
|
8
|
Goldammer P, Annen H, Lienhard C, Jonas K. An examination of model fit and measurement invariance of general mental ability and personality measures used in the multilingual context of the Swiss Armed Forces: A Bayesian structural equation modeling approach. MILITARY PSYCHOLOGY 2024; 36:96-113. [PMID: 38193872 PMCID: PMC10790799 DOI: 10.1080/08995605.2021.1963632] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/23/2021] [Accepted: 07/19/2021] [Indexed: 10/20/2022]
Abstract
Measurement invariance of psychological test batteries is an essential quality criterion when the test batteries are administered in different cultural and language contexts. The purpose of this study was to examine to what extent measurement model fit and measurement invariance across the two largest language groups in Switzerland (i.e., German and French speakers) can be assumed for selected general mental ability and personality tests used in the Swiss Armed Forces' cadre selection process. For the model fit and invariance testing, we used Bayesian structural equation modeling (BSEM). Because the sizes of the language group samples were unbalanced, we reran the invariance testing with the subsampling procedure as a robustness check. The results showed that at least partial approximate scalar invariance can be assumed for the constructs. However, comparisons in the full sample and subsamples also showed that certain test items function differently across the language groups. The results are discussed regarding the three following issues: First, we critically discuss the applied criterion and alternative effect size measures for assessing the practical importance of non-invariances. Second, we highlight potential remedies and further testing options, that can be applied, once certain items have been detected to function differently. Third, we discuss alternative modeling and invariance testing approaches to BSEM and outline future research avenues.
Collapse
Affiliation(s)
- Philippe Goldammer
- Department of Military Psychology and Pedagogics, Military Academy at ETH Zurich, Birmensdorf, Switzerland
| | - Hubert Annen
- Department of Military Psychology and Pedagogics, Military Academy at ETH Zurich, Birmensdorf, Switzerland
| | | | - Klaus Jonas
- Department of Psychology, University of Zurich, Zurich, Switzerland
| |
Collapse
|
9
|
Richson BN, Hazzard VM, Christensen KA, Hagan KE. Do the SCOFF items function differently by food-security status in U.S. college students?: Statistically, but not practically, significant differences. Eat Behav 2023; 49:101743. [PMID: 37209568 PMCID: PMC10681748 DOI: 10.1016/j.eatbeh.2023.101743] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 11/14/2022] [Revised: 03/05/2023] [Accepted: 04/24/2023] [Indexed: 05/22/2023]
Abstract
Despite food insecurity (FI) being associated with eating disorders (EDs), little research has examined if ED screening measures perform differently in individuals with FI. This study tested whether items on the SCOFF performed differently as a function of FI. As many people with FI hold multiple marginalized identities, this study also tested if the SCOFF performs differently as a function of food-security status in individuals with different gender identities and different perceived weight statuses. Data were from the 2020/2021 Healthy Minds Study (N = 122,269). Past-year FI was established using the two-item Hunger Vital Sign. Differential item functioning (DIF) assessed whether SCOFF items performed differently (i.e., had different probabilities of endorsement) in groups of individuals with FI versus those without. Both uniform DIF (constant between-group difference in item-endorsement probability across ED pathology) and non-uniform DIF (variable between-group difference in item-endorsement probability across ED pathology) were examined. Several SCOFF items demonstrated both statistically significant uniform and non-uniform DIF (ps < .001), but no instances of DIF reached practical significance (as indicated by effect sizes pseudo ΔR2 ≥ 0.035; all pseudo ΔR2's ≤ 0.006). When stratifying by gender identity and weight status, although most items demonstrated statistically significant DIF, only the SCOFF item measuring body-size perception showed practically significant non-uniform DIF for perceived weight status. Findings suggest the SCOFF is an appropriate screening measure for ED pathology among college students with FI and provide preliminary support for using the SCOFF in individuals with FI and certain marginalized identities.
Collapse
Affiliation(s)
- Brianne N Richson
- Department of Psychology, University of Kansas, Lawrence, KS, USA; Department of Psychiatry, University of California San Diego Eating Disorders Center for Treatment and Research, San Diego, CA, USA.
| | - Vivienne M Hazzard
- Division of Epidemiology & Community Health, University of Minnesota School of Public Health, Minneapolis, MN, USA
| | - Kara A Christensen
- Department of Psychology, University of Nevada, Las Vegas, Las Vegas, NV, USA
| | - Kelsey E Hagan
- Department of Psychiatry, Columbia University Irving Medical Center, New York, NY, USA; New York State Psychiatric Institute, New York, NY, USA
| |
Collapse
|
10
|
Chalmers RP. A Unified Comparison of IRT‐Based Effect Sizes for DIF Investigations. JOURNAL OF EDUCATIONAL MEASUREMENT 2022. [DOI: 10.1111/jedm.12347] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]
|
11
|
Joo S, Lee P. Detecting Differential Item Functioning Using Posterior Predictive Model Checking: A Comparison of Discrepancy Statistics. JOURNAL OF EDUCATIONAL MEASUREMENT 2022. [DOI: 10.1111/jedm.12316] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]
|
12
|
Taple BJ, Chapman R, Schalet BD, Brower R, Griffith JW. The Impact of Education on Depression Assessment: Differential Item Functioning Analysis. Assessment 2022; 29:272-284. [PMID: 33218257 PMCID: PMC9060911 DOI: 10.1177/1073191120971357] [Citation(s) in RCA: 9] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
Abstract
A person's level of education can affect their access to health care, and their health outcomes. Increasing rates of depression are another looming public health concern. Therefore, vulnerability is compounded for individuals who have a lower level of education and depression. Assessment of depressive symptoms is integral to many domains of health care including primary care and mental health specialty care. This investigation examined the degree to which education influences the psychometric properties of self-report items that measure depressive symptoms. This study was a secondary data analysis derived from three large internet panel studies. Together, the studies included the Beck Depression Inventory-II, the Center for Epidemiologic Studies Depression Scale, the Patient Health Questionnaire, and the Patient Reported Outcomes Measurement Information System measures of depression. Using a differential item functioning (DIF) approach, we found evidence of DIF such that some items on each of the questionnaires were flagged for DIF with effect sizes ranging from McFadden's Pseudo R2 = .005 to .022. For example, results included several double-barreled questions flagged for DIF. Overall, questionnaires assessing depression vary in level of complexity, which interacts with the respondent's level of education. Measurement of depression should include consideration of possible educational disparities, to identify people who may struggle with a written questionnaire, or may be subject to subtle psychometric biases associated with education.
Collapse
Affiliation(s)
- Bayley J Taple
- Northwestern University Feinberg School of Medicine, Chicago, IL, USA
| | - Robert Chapman
- Northwestern University Feinberg School of Medicine, Chicago, IL, USA
| | | | - Rylee Brower
- Northwestern University Feinberg School of Medicine, Chicago, IL, USA
| | - James W Griffith
- Northwestern University Feinberg School of Medicine, Chicago, IL, USA
| |
Collapse
|
13
|
Tay L, Woo SE, Hickman L, Booth BM, D’Mello S. A Conceptual Framework for Investigating and Mitigating Machine-Learning Measurement Bias (MLMB) in Psychological Assessment. ADVANCES IN METHODS AND PRACTICES IN PSYCHOLOGICAL SCIENCE 2022. [DOI: 10.1177/25152459211061337] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
Abstract
Given significant concerns about fairness and bias in the use of artificial intelligence (AI) and machine learning (ML) for psychological assessment, we provide a conceptual framework for investigating and mitigating machine-learning measurement bias (MLMB) from a psychometric perspective. MLMB is defined as differential functioning of the trained ML model between subgroups. MLMB manifests empirically when a trained ML model produces different predicted score levels for different subgroups (e.g., race, gender) despite them having the same ground-truth levels for the underlying construct of interest (e.g., personality) and/or when the model yields differential predictive accuracies across the subgroups. Because the development of ML models involves both data and algorithms, both biased data and algorithm-training bias are potential sources of MLMB. Data bias can occur in the form of nonequivalence between subgroups in the ground truth, platform-based construct, behavioral expression, and/or feature computing. Algorithm-training bias can occur when algorithms are developed with nonequivalence in the relation between extracted features and ground truth (i.e., algorithm features are differentially used, weighted, or transformed between subgroups). We explain how these potential sources of bias may manifest during ML model development and share initial ideas for mitigating them, including recognizing that new statistical and algorithmic procedures need to be developed. We also discuss how this framework clarifies MLMB but does not reduce the complexity of the issue.
Collapse
Affiliation(s)
- Louis Tay
- Department of Psychological Sciences, Purdue University, West Lafayette, Indiana
| | - Sang Eun Woo
- Department of Psychological Sciences, Purdue University, West Lafayette, Indiana
| | - Louis Hickman
- The Wharton School, University of Pennsylvania, Philadelphia, Pennsylvania
| | - Brandon M. Booth
- Institute of Cognitive Science, University of Colorado Boulder, Boulder, Colorado
| | - Sidney D’Mello
- Institute of Cognitive Science, University of Colorado Boulder, Boulder, Colorado
| |
Collapse
|
14
|
Lutz PK, O'Connor BP, Folk D. Dimensionality, Item Response Theory, Effect Size Attenuation, and Test Bias Analyses of the Self-Importance of Moral Identity Scale (SIMIS). J Pers Assess 2021; 104:586-598. [PMID: 34704515 DOI: 10.1080/00223891.2021.1991359] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/20/2022]
Abstract
The extent to which morality and being a moral person are important to one's identity is most commonly assessed using Aquino and Reed's (2002) Self-Importance of Moral Identity Scale (SIMIS). This study provided detailed psychometric examinations of the structure and discrimination levels of the SIMIS in a large (N = 2108) and heterogeneous sample. Results indicated that the SIMIS is clearly 2-dimensional, as expected. The Internalization and Symbolization subscales provided sufficient, and sometimes high levels of test information across the latent trait continuums. There were no redundant items and no bias based on gender. The most notable, albeit minor, shortcomings were that there are too many response options and that test information (discrimination power) was diminished at high levels of the Internalization latent trait continuum, apparently due to skewness. The fluctuating levels of measurement precision resulted in slightly greater attenuations in effect sizes for Internalization than for Symbolization across data for 31 other measures. The present findings from a large dataset and a variety of modern, revealing statistical methods provided relatively consistent, favorable findings for the measure.
Collapse
Affiliation(s)
| | - Brian P O'Connor
- Department of Psychology, University of British Columbia, Okanagan
| | | |
Collapse
|
15
|
Teresi JA, Wang C, Kleinman M, Jones RN, Weiss DJ. Differential Item Functioning Analyses of the Patient-Reported Outcomes Measurement Information System (PROMIS®) Measures: Methods, Challenges, Advances, and Future Directions. PSYCHOMETRIKA 2021; 86:674-711. [PMID: 34251615 PMCID: PMC8889890 DOI: 10.1007/s11336-021-09775-0] [Citation(s) in RCA: 11] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 06/11/2020] [Revised: 03/02/2021] [Accepted: 05/19/2021] [Indexed: 06/12/2023]
Abstract
Several methods used to examine differential item functioning (DIF) in Patient-Reported Outcomes Measurement Information System (PROMIS®) measures are presented, including effect size estimation. A summary of factors that may affect DIF detection and challenges encountered in PROMIS DIF analyses, e.g., anchor item selection, is provided. An issue in PROMIS was the potential for inadequately modeled multidimensionality to result in false DIF detection. Section 1 is a presentation of the unidimensional models used by most PROMIS investigators for DIF detection, as well as their multidimensional expansions. Section 2 is an illustration that builds on previous unidimensional analyses of depression and anxiety short-forms to examine DIF detection using a multidimensional item response theory (MIRT) model. The Item Response Theory-Log-likelihood Ratio Test (IRT-LRT) method was used for a real data illustration with gender as the grouping variable. The IRT-LRT DIF detection method is a flexible approach to handle group differences in trait distributions, known as impact in the DIF literature, and was studied with both real data and in simulations to compare the performance of the IRT-LRT method within the unidimensional IRT (UIRT) and MIRT contexts. Additionally, different effect size measures were compared for the data presented in Section 2. A finding from the real data illustration was that using the IRT-LRT method within a MIRT context resulted in more flagged items as compared to using the IRT-LRT method within a UIRT context. The simulations provided some evidence that while unidimensional and multidimensional approaches were similar in terms of Type I error rates, power for DIF detection was greater for the multidimensional approach. Effect size measures presented in Section 1 and applied in Section 2 varied in terms of estimation methods, choice of density function, methods of equating, and anchor item selection. Despite these differences, there was considerable consistency in results, especially for the items showing the largest values. Future work is needed to examine DIF detection in the context of polytomous, multidimensional data. PROMIS standards included incorporation of effect size measures in determining salient DIF. Integrated methods for examining effect size measures in the context of IRT-based DIF detection procedures are still in early stages of development.
Collapse
Affiliation(s)
- Jeanne A Teresi
- Columbia University Stroud Center, New York, NY, USA.
- Research Division, Hebrew Home at Riverdale; RiverSpring Health, Bronx, NY, USA.
- Department of Geriatrics and Palliative Medicine, Weill Cornell Medical Center, New York, NY, USA.
- New York State Psychiatric Institute, New York, NY, USA.
| | - Chun Wang
- Center for Statistics and the Social Sciences (Affiliate), University of Washington College of Education, Seattle, WA, USA
| | | | - Richard N Jones
- Department of Psychiatry and Human Behavior, Warren Alpert Medical School, Brown University, Providence, RI, USA
| | | |
Collapse
|
16
|
Lee P, Joo SH, Stark S. Detecting DIF in Multidimensional Forced Choice Measures Using the Thurstonian Item Response Theory Model. ORGANIZATIONAL RESEARCH METHODS 2020. [DOI: 10.1177/1094428120959822] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/17/2022]
Abstract
Although modern item response theory (IRT) methods of test construction and scoring have overcome ipsativity problems historically associated with multidimensional forced choice (MFC) formats, there has been little research on MFC differential item functioning (DIF) detection, where item refers to a block, or group, of statements presented for an examinee’s consideration. This research investigated DIF detection with three-alternative MFC items based on the Thurstonian IRT (TIRT) model, using omnibus Wald tests on loadings and thresholds. We examined constrained and free baseline model comparisons strategies with different types and magnitudes of DIF, latent trait correlations, sample sizes, and levels of impact in an extensive Monte Carlo study. Results indicated the free baseline strategy was highly effective in detecting DIF, with power approaching 1.0 in the large sample size and large magnitude of DIF conditions, and similar effectiveness in the impact and no-impact conditions. This research also included an empirical example to demonstrate the viability of the best performing method with real examinees and showed how a DIF and a DTF effect size measure can be used to assess the practical significance of MFC DIF findings.
Collapse
|
17
|
Dong Y, Dumas D. Are personality measures valid for different populations? A systematic review of measurement invariance across cultures, gender, and age. PERSONALITY AND INDIVIDUAL DIFFERENCES 2020. [DOI: 10.1016/j.paid.2020.109956] [Citation(s) in RCA: 21] [Impact Index Per Article: 4.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Indexed: 10/24/2022]
|
18
|
Lineberry M, Park YS, Hennessy SA, Ritter EM. The Fundamentals of Endoscopic Surgery (FES) skills test: factors associated with first-attempt scores and pass rate. Surg Endosc 2020; 34:3633-3643. [PMID: 32519273 DOI: 10.1007/s00464-020-07690-6] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/17/2020] [Accepted: 05/27/2020] [Indexed: 10/24/2022]
Abstract
BACKGROUND The Fundamentals of Endoscopic Surgery (FES) program became required for American Board of Surgery certification as part of the Flexible Endoscopy Curriculum (FEC) for residents graduating in 2018. This study expands prior psychometric investigation of the FES skills test. METHODS We analyzed de-identified first-attempt skills test scores and self-reported demographic characteristics of 2023 general surgery residents who were required to pass FES. RESULTS The overall pass rate was 83%. "Loop Reduction" was the most difficult sub-task. Subtasks related to one another only modestly (Spearman's ρ ranging from 0.11 to 0.42; coefficient α = .55). Both upper and lower endoscopic procedural experience had modest positive association with scores (ρ = 0.14 and 0.15) and passing. Examinees who tested on the GI Mentor Express simulator had lower total scores and a lower pass rate than those tested on the GI Mentor II (pass rates = 73% vs. 85%). Removing an Express-specific scoring rule that had been applied eliminated these differences. Gender, glove size, and height were closely related. Women scored lower than men (408- vs. 489-point averages) and had a lower first-attempt pass rate (71% vs. 92%). Glove size correlated positively with score (ρ = 0.31) and pass rate. Finally, height correlated positively with score (r = 0.27) and pass rate. Statistically controlling for glove size and height did not eliminate gender differences, with men still having 3.2 times greater odds of passing. CONCLUSIONS FES skills test scores show both consistencies with the assessment's validity argument and several remarkable findings. Subtasks reflect distinct skills, so passing standards should perhaps be set for each subtask. The Express simulator-specific scoring penalty should be removed. Differences seen by gender are concerning. We argue those differences do not reflect measurement bias, but rather highlight equity concerns in surgical technology, training, and practice.
Collapse
Affiliation(s)
- Matthew Lineberry
- Zamierowski Institute for Experiential Learning & Department of Population Health, University of Kansas Medical Center and Health System, Kansas City, KS, USA. .,University of Kansas Medical Center and Health System, 3901 Rainbow Boulevard, Sudler Hall G005, Kansas City, KS, 66160, USA.
| | - Yoon Soo Park
- Department of Medical Education, University of Illinois at Chicago, Chicago, IL, USA
| | - Sara A Hennessy
- Department of Surgery, UT Southwestern Medical Center, Dallas, TX, USA
| | - E Matthew Ritter
- Division of General Surgery, Department of Surgery, Uniformed Services University/Walter Reed National Military Medical Center, Bethesda, MD, USA
| |
Collapse
|
19
|
Terluin B, van der Wouden JC, de Vet HCW. Measurement equivalence of the Four-Dimensional Symptom Questionnaire (4DSQ) in adolescents and emerging adults. PLoS One 2019; 14:e0221904. [PMID: 31465490 PMCID: PMC6715201 DOI: 10.1371/journal.pone.0221904] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/30/2019] [Accepted: 08/16/2019] [Indexed: 11/18/2022] Open
Abstract
The Four-Dimensional Symptom Questionnaire (4DSQ) is a self-report instrument measuring distress, depression, anxiety and somatization. The questionnaire has been developed and validated in adult samples. It is unknown whether adolescents and emerging adults respond to the 4DSQ items in the same way as adults do. The objective of the study was to examine measurement equivalence of the 4DSQ across adolescents, emerging adults and adults. 4DSQ data were collected in a primary care psychotherapy practice (N = 1349). Measurement equivalence was assessed using differential item and test functioning (DIF and DTF) analysis in an item response theory framework. DIF was compared across the following groups: adolescents (age 10–17), emerging adults (age 18–25), and adults (age 26–40). DIF was found in 9 items (out of 50) across adolescents and adults, and in 4 items across emerging adults and adults. The item with the largest DIF was ‘difficulty getting to sleep’, which was less severe for adolescents compared to adults. A likely explanation is that adolescents have a high base rate for problems with sleep initiation. The effect of DIF on the scale scores (DTF) was negligible. Adolescents and emerging adults score some 4DSQ items differently compared to adults but this had practically no effect on 4DSQ scale scores. 4DSQ scale scores from adolescents and emerging adults can be interpreted in the same way as 4DSQ scores from adults.
Collapse
Affiliation(s)
- Berend Terluin
- Department of General Practice and Elderly Care Medicine, Amsterdam Public Health research institute, Amsterdam UMC–Vrije Universiteit Amsterdam, Amsterdam, The Netherlands
- * E-mail:
| | - Johannes C. van der Wouden
- Department of General Practice and Elderly Care Medicine, Amsterdam Public Health research institute, Amsterdam UMC–Vrije Universiteit Amsterdam, Amsterdam, The Netherlands
| | - Henrica C. W. de Vet
- Department of Epidemiology and Biostatistics, Amsterdam Public Health research institute, Amsterdam UMC–Vrije Universiteit Amsterdam, Amsterdam, The Netherlands
| |
Collapse
|
20
|
Quantifying the impact of partial measurement invariance in diagnostic research: An application to addiction research. Addict Behav 2019; 94:50-56. [PMID: 30502928 DOI: 10.1016/j.addbeh.2018.11.029] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/01/2018] [Revised: 09/20/2018] [Accepted: 11/19/2018] [Indexed: 11/23/2022]
Abstract
Establishing measurement invariance, or that an instrument measures the same construct(s) in the same way across subgroups of respondents, is crucial in efforts to validate social and behavioral instruments. Although substantial previous research has focused on detecting the presence of noninvariance, less attention has been devoted to its practical significance and even less has been paid to its possible impact on diagnostic accuracy. In this article, we draw additional attention to the importance of measurement invariance and advance diagnostic research by introducing a novel approach for quantifying the impact of noninvariance with binary items (e.g., the presence or absence of symptoms). We illustrate this approach by testing measurement invariance and evaluating diagnostic accuracy across age groups using DSM alcohol use disorder items from a public national data set. By providing researchers with an easy-to-implement R program for examining diagnostic accuracy with binary items, this article sets the stage for future evaluations of the practical significance of partial invariance. Future work can extend our framework to include ordinal and categorical indicators, other measurement models in item response theory, settings with three or more groups, and via comparison to an external, "gold-standard" validator.
Collapse
|
21
|
Nye CD, Joo SH, Zhang B, Stark S. Advancing and Evaluating IRT Model Data Fit Indices in Organizational Research. ORGANIZATIONAL RESEARCH METHODS 2019. [DOI: 10.1177/1094428119833158] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/17/2022]
Abstract
Item response theory (IRT) models have a number of advantages for developing and evaluating scales in organizational research. However, these advantages can be obtained only when the IRT model used to estimate the parameters fits the data well. Therefore, examining IRT model fit is important before drawing conclusions from the data. To test model fit, a wide range of indices are available in the IRT literature and have demonstrated utility in past research. Nevertheless, the performance of many of these indices for detecting misfit has not been directly compared in simulations. The current study evaluates a number of these indices to determine their utility for detecting various types of misfit in both dominance and ideal point IRT models. Results indicate that some indices are more effective than others but that none of the indices accurately detected misfit due to multidimensionality in the data. The implications of these results for future organizational research are discussed.
Collapse
Affiliation(s)
| | | | - Bo Zhang
- University of Illinois at Urbana-Champaign, Champaign, IL, USA
| | | |
Collapse
|
22
|
Shi D, Song H, DiStefano C, Maydeu-Olivares A, McDaniel HL, Jiang Z. Evaluating Factorial Invariance: An Interval Estimation Approach Using Bayesian Structural Equation Modeling. MULTIVARIATE BEHAVIORAL RESEARCH 2019; 54:224-245. [PMID: 30569738 DOI: 10.1080/00273171.2018.1514484] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/09/2023]
Abstract
In this study, we introduce an interval estimation approach based on Bayesian structural equation modeling to evaluate factorial invariance. For each tested parameter, the size of noninvariance with an uncertainty interval (i.e. highest density interval [HDI]) is assessed via Bayesian parameter estimation. By comparing the most credible values (i.e. 95% HDI) with a region of practical equivalence (ROPE), the Bayesian approach allows researchers to (1) support the null hypothesis of practical invariance, and (2) examine the practical importance of the noninvariant parameter. Compared to the traditional likelihood ratio test, simulation results suggested that the proposed Bayesian approach could offer additional insight into evaluating factorial invariance, thus, leading to more informative conclusions. We provide an empirical example to demonstrate the procedures necessary to implement the proposed method in applied research. The importance of and influences on the choice of an appropriate ROPE are discussed.
Collapse
|
23
|
Sommer M, Arendasy ME, Punter JF, Feldhammer-Kahr M, Rieder A. Do individual differences in test-takers' appraisal of admission testing compromise measurement fairness? INTELLIGENCE 2019. [DOI: 10.1016/j.intell.2019.01.006] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/27/2022]
|
24
|
Herde CN, Lievens F, Solberg EG, Harbaugh JL, Strong MH, J. Burkholder G. Situational Judgment Tests as Measures of 21st Century Skills: Evidence across Europe and Latin America. REVISTA DE PSICOLOGÍA DEL TRABAJO Y DE LAS ORGANIZACIONES 2019. [DOI: 10.5093/jwop2019a8] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/22/2022]
|
25
|
Chalmers RP. Model-Based Measures for Detecting and Quantifying Response Bias. PSYCHOMETRIKA 2018; 83:696-732. [PMID: 29907891 DOI: 10.1007/s11336-018-9626-9] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 02/18/2017] [Revised: 03/06/2018] [Indexed: 06/08/2023]
Abstract
This paper proposes a model-based family of detection and quantification statistics to evaluate response bias in item bundles of any size. Compensatory (CDRF) and non-compensatory (NCDRF) response bias measures are proposed, along with their sample realizations and large-sample variability when models are fitted using multiple-group estimation. Based on the underlying connection to item response theory estimation methodology, it is argued that these new statistics provide a powerful and flexible approach to studying response bias for categorical response data over and above methods that have previously appeared in the literature. To evaluate their practical utility, CDRF and NCDRF are compared to the closely related SIBTEST family of statistics and likelihood-based detection methods through a series of Monte Carlo simulations. Results indicate that the new statistics are more optimal effect size estimates of marginal response bias than the SIBTEST family, are competitive with a selection of likelihood-based methods when studying item-level bias, and are the most optimal when studying differential bundle and test bias.
Collapse
Affiliation(s)
- R Philip Chalmers
- Department of Educational Psychology, The University of Georgia, 323 Aderhold Hall, Athens, GA , 30602, USA.
| |
Collapse
|
26
|
Rome L, Zhang B. Investigating the Effects of Differential Item Functioning on Proficiency Classification. APPLIED PSYCHOLOGICAL MEASUREMENT 2018; 42:259-274. [PMID: 29881124 PMCID: PMC5978605 DOI: 10.1177/0146621617726789] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/08/2023]
Abstract
This study provides a comprehensive evaluation of the effects of differential item functioning (DIF) on proficiency classification. Using Monte Carlo simulation, item- and test-level DIF magnitudes were varied systematically to investigate their impact on proficiency classification at multiple decision points. Findings from this study clearly show that the presence of DIF affects proficiency classification not by lowering the overall correct classification rates but by affecting classification error rates differently for reference and focal group members. The study also reveals that multiple items with low levels of DIF can be particularly problematic. They can do similar damage to proficiency classification as high-level DIF items with the same cumulative magnitudes but are much harder to detect with current DIF and differential bundle functioning (DBF) techniques. Finally, how DIF affects proficiency classification errors at multiple cut scores is fully described and discussed.
Collapse
Affiliation(s)
- Logan Rome
- University of Wisconsin–Milwaukee, WI, USA
| | - Bo Zhang
- University of Wisconsin–Milwaukee, WI, USA
| |
Collapse
|
27
|
Nye CD, Bradburn J, Olenick J, Bialko C, Drasgow F. How Big Are My Effects? Examining the Magnitude of Effect Sizes in Studies of Measurement Equivalence. ORGANIZATIONAL RESEARCH METHODS 2018. [DOI: 10.1177/1094428118761122] [Citation(s) in RCA: 39] [Impact Index Per Article: 5.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022]
Affiliation(s)
| | - Jacob Bradburn
- Department of Psychology, Michigan State University, MI, USA
| | - Jeffrey Olenick
- Department of Psychology, Michigan State University, MI, USA
| | | | - Fritz Drasgow
- Department of Psychology and School of Labor and Employment Relations, University of Illinois, Champaign, IL, USA
| |
Collapse
|
28
|
Adroher ND, Prodinger B, Fellinghauer CS, Tennant A. All metrics are equal, but some metrics are more equal than others: A systematic search and review on the use of the term 'metric'. PLoS One 2018; 13:e0193861. [PMID: 29509813 PMCID: PMC5839589 DOI: 10.1371/journal.pone.0193861] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/17/2017] [Accepted: 02/19/2018] [Indexed: 11/24/2022] Open
Abstract
OBJECTIVE To examine the use of the term 'metric' in health and social sciences' literature, focusing on the interval scale implication of the term in Modern Test Theory (MTT). MATERIALS AND METHODS A systematic search and review on MTT studies including 'metric' or 'interval scale' was performed in the health and social sciences literature. The search was restricted to 2001-2005 and 2011-2015. A Text Mining algorithm was employed to operationalize the eligibility criteria and to explore the uses of 'metric'. The paradigm of each included article (Rasch Measurement Theory (RMT), Item Response Theory (IRT) or both), as well as its type (Theoretical, Methodological, Teaching, Application, Miscellaneous) were determined. An inductive thematic analysis on the first three types was performed. RESULTS 70.6% of the 1337 included articles were allocated to RMT, and 68.4% were application papers. Among the number of uses of 'metric', it was predominantly a synonym of 'scale'; as adjective, it referred to measurement or quantification. Three incompatible themes 'only RMT/all MTT/no MTT models can provide interval measures' were identified, but 'interval scale' was considerably more mentioned in RMT than in IRT. CONCLUSION 'Metric' is used in many different ways, and there is no consensus on which MTT metric has interval scale properties. Nevertheless, when using the term 'metric', the authors should specify the level of the metric being used (ordinal, ordered, interval, ratio), and justify why according to them the metric is at that level.
Collapse
Affiliation(s)
- Núria Duran Adroher
- Swiss Paraplegic Research, Nottwil, Switzerland
- Department of Health Sciences and Health Policy, University of Lucerne, Lucerne, Switzerland
| | - Birgit Prodinger
- Swiss Paraplegic Research, Nottwil, Switzerland
- Department of Health Sciences and Health Policy, University of Lucerne, Lucerne, Switzerland
- Faculty of Applied Health and Social Sciences, University of Applied Sciences Rosenheim, Rosenheim, Germany
| | - Carolina Saskia Fellinghauer
- Swiss Paraplegic Research, Nottwil, Switzerland
- Department of Health Sciences and Health Policy, University of Lucerne, Lucerne, Switzerland
| | - Alan Tennant
- Swiss Paraplegic Research, Nottwil, Switzerland
- Department of Health Sciences and Health Policy, University of Lucerne, Lucerne, Switzerland
| |
Collapse
|
29
|
Does the 15-item Geriatric Depression Scale function differently in old people with different levels of cognitive functioning? J Affect Disord 2018; 227:471-476. [PMID: 29156360 DOI: 10.1016/j.jad.2017.11.045] [Citation(s) in RCA: 16] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 01/28/2017] [Revised: 09/09/2017] [Accepted: 11/11/2017] [Indexed: 11/20/2022]
Abstract
BACKGROUND The 15-item version of the Geriatric Depression Scale (GDS-15) is widely employed to screen depression among elderly but little is known about the scale functioning in cognitively impaired individuals when compared to normal ones. The aim of the current study is to investigate Differential Item Functioning (DIF) across groups of older people that differ in terms of cognitive functioning applying Item Response Theory (IRT)-based analyses. METHODS Data from an Italian multi-centric clinical-based study on cognitive impairment and dementia in old people were employed (N = 1903; Age: M = 77.33, SD = 7.05, 62% women). All the participants underwent a comprehensive evaluation (including clinical examination, laboratory screening, neuroimaging, and cognitive and behavioral assessments) and they were assigned to three different groups on the basis of their cognitive functioning (normal, mild cognitive impairment, cognitive impairment) RESULTS: Two items showed uniform DIF but their differential functioning does not propagate to the GDS-15 total scores in such a way that a differential interpretation is needed LIMITATIONS: Whereas an advantage of the study is the large sample size, the relatively small size of the mild cognitive impairment group might reduce the stability of the present results CONCLUSIONS: Since a screening tool for elderly is intended to apply to everyone in the target population, the current findings support the clinical utility of the GDS-15 as screening tool for depression.
Collapse
|
30
|
Chiesi F, Primi C, Pigliautile M, Baroni M, Ercolani S, Boccardi V, Ruggiero C, Mecocci P. Is the 15-item Geriatric Depression Scale a Fair Screening Tool? A Differential Item Functioning Analysis Across Gender and Age. Psychol Rep 2017; 121:1167-1182. [PMID: 29298589 DOI: 10.1177/0033294117745561] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/17/2022]
Abstract
The 15-item version of the Geriatric Depression Scale (GDS-15) is widely employed to assess depression in old people, but it is unclear if there are biases in the total score depending on respondents' gender and age. In the current study, we investigated the measurement equivalence of the GDS-15 to provide evidence that the test is a fair screening tool when administered to young-old, old-old, and oldest-old men and women. Item Response Theory-based Differential Item Functioning analyses were applied on a large sample of Italian old people. One item exhibited Differential Item Functioning when comparing men and women, and one item showed Differential Item Functioning across different age-groups. Nonetheless, the magnitude of Differential Item Functioning was small and did not produce any differential test functioning. The gender and age measurement equivalence of the GDS-15 confirms that the test can be used for clinical and research screening purposes.
Collapse
Affiliation(s)
- Francesca Chiesi
- Department of Neuroscience, Psychology, Drug, and Child's Health (NEUROFARBA), Section of Psychology, University of Florence, Italy
| | - Caterina Primi
- Department of Neuroscience, Psychology, Drug, and Child's Health (NEUROFARBA), Section of Psychology, University of Florence, Italy
| | - Martina Pigliautile
- Department of Medicine, Institute of Gerontology and Geriatrics, University of Perugia, Italy
| | - Marta Baroni
- Department of Medicine, Institute of Gerontology and Geriatrics, University of Perugia, Italy
| | - Sara Ercolani
- Department of Medicine, Institute of Gerontology and Geriatrics, University of Perugia, Italy
| | - Virginia Boccardi
- Department of Medicine, Institute of Gerontology and Geriatrics, University of Perugia, Italy
| | - Carmelinda Ruggiero
- Department of Medicine, Institute of Gerontology and Geriatrics, University of Perugia, Italy
| | - Patrizia Mecocci
- Department of Medicine, Institute of Gerontology and Geriatrics, University of Perugia, Italy
| |
Collapse
|
31
|
Bowe AG. Moving Toward More Conclusive Measures of Sociocultural Adaptation for Ethnically Diverse Adolescents in England. CANADIAN JOURNAL OF SCHOOL PSYCHOLOGY 2017. [DOI: 10.1177/0829573517739392] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022]
Abstract
This study is part of a larger initiative toward understanding the acculturation of immigrant adolescents using the Longitudinal Study of Young People in England 2004-2010 database. A necessary step in using a database for cross-ethnic comparisons is first to verify whether its items and scales are equivalent. I examined item- and scale-level differential functioning (DF; n = 4,663, six ethnic minority groups) on four of the database’s sociocultural scales: Feelings About School (11 items), Relational Family Efficacy (four items), Being Bullied (five items), and Perceived Teacher Discrimination (four items) using an item response theory (IRT)–based framework. Findings demonstrated no meaningful DF on items and, in most cases, scales as well. Second, distinct ethnic group patterns are present. Third, the Perceived Teacher Discrimination scale was not functioning for the majority of the ethnic minority groups which is of grave concern. Implications for future comparative studies and immigration policy makers are discussed.
Collapse
|
32
|
Foster GC, Min H, Zickar MJ. Review of Item Response Theory Practices in Organizational Research. ORGANIZATIONAL RESEARCH METHODS 2017. [DOI: 10.1177/1094428116689708] [Citation(s) in RCA: 23] [Impact Index Per Article: 2.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
Abstract
In this article, we review recent psychometric practices to determine how item response theory (IRT) has been used in organizational research. We identified and coded 63 articles that used IRT on empirical data published in industrial-organizational and organizational behavior journals since 2000. Results show that typical usage for IRT conforms to best practices in several ways; however, in other ways, such as testing for and reporting model fit, there is still significant room for improvement. Next, we surveyed academic and practitioner members of the Society for Industrial-Organizational Psychology (SIOP) on their experiences and attitudes toward IRT. We conclude that IRT is one area where practice outpaces science. There is a cadre of practitioners that consider IRT essential to their professional life. For others, however, IRT is seen as less relevant. Based on our coding analyses and survey results, we provide suggestions on how to better incorporate IRT into organizational research and practice.
Collapse
Affiliation(s)
| | - Hanyi Min
- Bowling Green State University, Bowling Green, OH, USA
| | | |
Collapse
|
33
|
The Four-Dimensional Symptom Questionnaire (4DSQ) in the general population: scale structure, reliability, measurement invariance and normative data: a cross-sectional survey. Health Qual Life Outcomes 2016; 14:130. [PMID: 27629535 PMCID: PMC5024427 DOI: 10.1186/s12955-016-0533-4] [Citation(s) in RCA: 24] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/14/2016] [Accepted: 09/07/2016] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND The Four-Dimensional Symptom Questionnaire (4DSQ) is a self-report questionnaire measuring distress, depression, anxiety and somatization with separate scales. The 4DSQ has extensively been validated in clinical samples, especially from primary care settings. Information about measurement properties and normative data in the general population was lacking. In a Dutch general population sample we examined the 4DSQ scales' structure, the scales' reliability and measurement invariance with respect to gender, age and education, the scales' score distributions across demographic categories, and normative data. METHODS 4DSQ data were collected in a representative Dutch Internet panel. Confirmatory factor analysis was used to examine the scales' structure. Reliability was examined by Cronbach's alpha, and coefficients omega-total and omega-hierarchical. Differential item functioning (DIF) analysis was used to evaluate measurement invariance across gender, age and education. RESULTS The total response rate was 82.4 % (n = 5273/6399). The depression scale proved to be unidimensional. The other scales were best represented as bifactor models consisting of a large general factor and one or more smaller specific factors. The general factors accounted for more than 95 % of the reliable variance of the scales. Reliability was high (≥0.85) by all estimates. The distress-, depression- and anxiety scales were invariant across gender, age and education. The somatization scale demonstrated some lack of measurement invariance as a result of decreased thresholds for some of the items in young people (16-24 years) and increased thresholds in elderly people (65+ years). The somatization scale was invariant regarding gender and education. The 4DSQ scores varied significantly across demographic categories, but the explained variance was small (<6 %). Normative data were generated for gender and age categories. Approximately 17 % of the participants scored above average on de distress scale, whereas 12 % scored above average on de somatization scale. Percentages of people scoring high enough on depression or anxiety as to suspect the presence of depressive or anxiety disorder were 4.1 and 2.5 respectively. CONCLUSIONS Evidence supports reliability and measurement invariance of the 4DSQ in the general Dutch population. The normative data provided in this study can be used to compare a subject's 4DSQ scores with a general population reference group.
Collapse
|
34
|
Nye CD, Sackett PR. New Effect Sizes for Tests of Categorical Moderation and Differential Prediction. ORGANIZATIONAL RESEARCH METHODS 2016. [DOI: 10.1177/1094428116644505] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
Abstract
Moderator hypotheses involving categorical variables are prevalent in organizational and psychological research. Despite their importance, current methods of identifying and interpreting these moderation effects have several limitations that may result in misleading conclusions about their implications. This issue has been particularly salient in the literature on differential prediction where recent research has suggested that these limitations have had a significant impact on past research. To help address these issues, we propose several new effect size indices that provide additional information about categorical moderation analyses. The advantages of these indices are then illustrated in two large databases of respondents by examining categorical moderation in the prediction of psychological well-being and the extent of differential prediction in a large sample of job incumbents.
Collapse
Affiliation(s)
- Christopher D. Nye
- Department of Psychology, Michigan State University, East Lansing, MI, USA
| | - Paul R. Sackett
- Department of Psychology, University of Minnesota, Minneapolis, MN, USA
| |
Collapse
|
35
|
Do individual differences in test preparation compromise the measurement fairness of admission tests? INTELLIGENCE 2016. [DOI: 10.1016/j.intell.2016.01.004] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/23/2022]
|
36
|
Teresi JA, Jones RN. Methodological Issues in Examining Measurement Equivalence in Patient Reported Outcomes Measures: Methods Overview to the Two-Part Series, "Measurement Equivalence of the Patient Reported Outcomes Measurement Information System ® (PROMIS ®) Short Forms". PSYCHOLOGICAL TEST AND ASSESSMENT MODELING 2016; 58:37-78. [PMID: 28983448 PMCID: PMC5625814] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Subscribe] [Scholar Register] [Indexed: 06/07/2023]
Abstract
The purpose of this article is to introduce the methods used and challenges confronted by the authors of this two-part series of articles describing the results of analyses of measurement equivalence of the short form scales from the Patient Reported Outcomes Measurement Information System® (PROMIS®). Qualitative and quantitative approaches used to examine differential item functioning (DIF) are reviewed briefly. Qualitative methods focused on generation of DIF hypotheses. The basic quantitative approaches used all rely on a latent variable model, and examine parameters either derived directly from item response theory (IRT) or from structural equation models (SEM). A key methods focus of these articles is to describe state-of-the art approaches to examination of measurement equivalence in eight domains: physical health, pain, fatigue, sleep, depression, anxiety, cognition, and social function. These articles represent the first time that DIF has been examined systematically in the PROMIS short form measures, particularly among ethnically diverse groups. This is also the first set of analyses to examine the performance of PROMIS short forms in patients with cancer. Latent variable model state-of-the-art methods for examining measurement equivalence are introduced briefly in this paper to orient readers to the approaches adopted in this set of papers. Several methodological challenges underlying (DIF-free) anchor item selection and model assumption violations are presented as a backdrop for the articles in this two-part series on measurement equivalence of PROMIS measures.
Collapse
Affiliation(s)
- Jeanne A. Teresi
- Weill Cornell Medical College, Division of Geriatrics and Palliative Medicine
- Research Division, Hebrew Home at Riverdale; RiverSpring Health
| | - Richard N. Jones
- Department of Psychiatry and Human Behavior, Department of Neurology, Warren Alpert Medical School, Brown University
| |
Collapse
|
37
|
Kleinman M, Teresi JA. Differential item functioning magnitude and impact measures from item response theory models. PSYCHOLOGICAL TEST AND ASSESSMENT MODELING 2016; 58:79-98. [PMID: 28706769 PMCID: PMC5505278] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Subscribe] [Scholar Register] [Indexed: 06/07/2023]
Abstract
Measures of magnitude and impact of differential item functioning (DIF) at the item and scale level, respectively are presented and reviewed in this paper. Most measures are based on item response theory models. Magnitude refers to item level effect sizes, whereas impact refers to differences between groups at the scale score level. Reviewed are magnitude measures based on group differences in the expected item scores and impact measures based on differences in the expected scale scores. The similarities among these indices are demonstrated. Various software packages are described that provide magnitude and impact measures, and new software presented that computes all of the available statistics conveniently in one program with explanations of their relationships to one another.
Collapse
Affiliation(s)
- Marjorie Kleinman
- Correspondence concerning this article should be addressed to: Marjorie Kleinman, M.S., New York State Psychiatric Institute, 1051 Riverside Drive, Unit 72, New York, NY, 10032, USA;
| | - Jeanne A. Teresi
- New York State Psychiatric Institute
- Columbia University Stroud Center
- Research Division, Hebrew Home at Riverdale; RiverSpring Health
- Department of Geriatrics and Palliative Medicine, Weill Cornell Medical Center
| |
Collapse
|
38
|
Sommer M, Arendasy ME. Further evidence for the deficit account of the test anxiety–test performance relationship from a high-stakes admission testing setting. INTELLIGENCE 2015. [DOI: 10.1016/j.intell.2015.08.007] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/23/2022]
|
39
|
Wright NA, Kutschenko K, Bush BA, Hannum KM, Braddy PW. Measurement and Predictive Invariance of a Work-Life Boundary Measure Across Gender. INTERNATIONAL JOURNAL OF SELECTION AND ASSESSMENT 2015. [DOI: 10.1111/ijsa.12102] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022]
Affiliation(s)
- Natalie A. Wright
- Department of Psychology & Counseling; Valdosta State University; 1500 N. Patterson St Valdosta GA 31698 USA
| | | | - Bryant A. Bush
- Department of Psychology & Counseling; Valdosta State University; 1500 N. Patterson St Valdosta GA 31698 USA
| | | | | |
Collapse
|
40
|
Nye CD, Allemand M, Gosling SD, Potter J, Roberts BW. Personality Trait Differences Between Young and Middle-Aged Adults: Measurement Artifacts or Actual Trends? J Pers 2015; 84:473-92. [DOI: 10.1111/jopy.12173] [Citation(s) in RCA: 21] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/01/2022]
|
41
|
Egberink IJL, Meijer RR, Tendeiro JN. Investigating Measurement Invariance in Computer-Based Personality Testing: The Impact of Using Anchor Items on Effect Size Indices. EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 2015; 75:126-145. [PMID: 29795815 PMCID: PMC5965504 DOI: 10.1177/0013164414520965] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/01/2023]
Abstract
A popular method to assess measurement invariance of a particular item is based on likelihood ratio tests with all other items as anchor items. The results of this method are often only reported in terms of statistical significance, and researchers proposed different methods to empirically select anchor items. It is unclear, however, how many anchor items should be selected and which method will provide the "best" results using empirical data. In the present study, we examined the impact of using different numbers of anchor items on effect size indices when investigating measurement invariance on a personality questionnaire in two different assessment situations. Results suggested that the effect size indices were not influenced by using different numbers of anchor items. The values were comparable across different number of anchor items used and were small, which indicate that the effect of differential functioning at the item and test level is very small if not negligible. Practical implications are discussed and we discuss the use of anchor items and effect size indices in practice.
Collapse
|
42
|
Do BR. Research on Unproctored Internet Testing. INDUSTRIAL AND ORGANIZATIONAL PSYCHOLOGY-PERSPECTIVES ON SCIENCE AND PRACTICE 2015. [DOI: 10.1111/j.1754-9434.2008.01107.x] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/27/2022]
|
43
|
Tay L, Meade AW, Cao M. An Overview and Practical Guide to IRT Measurement Equivalence Analysis. ORGANIZATIONAL RESEARCH METHODS 2014. [DOI: 10.1177/1094428114553062] [Citation(s) in RCA: 71] [Impact Index Per Article: 6.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
Abstract
This article provides an overview and guide to implementing item response theory (IRT) measurement equivalence (ME) or differential item functioning (DIF) analysis. We (a) present the need for establishing IRT ME/DIF analysis, (b) discuss the similarities and differences between factor-analytic ME/DIF analysis, (c) review commonly used IRT ME/DIF indices and procedures, (d) provide three illustrations to two recommended IRT procedures, and (e) furnish recommendations for conducting IRT ME/DIF. We conclude by discussing future directions for IRT ME/DIF research.
Collapse
Affiliation(s)
- Louis Tay
- Purdue University, West Lafayette, IN, USA
| | - Adam W Meade
- North Carolina State University, Raleigh, NC, USA
| | - Mengyang Cao
- University of Illinois at Urbana-Champaign, Champaign, IL, USA
| |
Collapse
|
44
|
Terluin B, Smits N, Miedema B. The English version of the four-dimensional symptom questionnaire (4DSQ) measures the same as the original Dutch questionnaire: a validation study. Eur J Gen Pract 2014; 20:320-6. [PMID: 24779532 DOI: 10.3109/13814788.2014.905826] [Citation(s) in RCA: 25] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
BACKGROUND Translations of questionnaires need to be carefully validated to assure that the translation measures the same construct(s) as the original questionnaire. The four-dimensional symptom questionnaire (4DSQ) is a Dutch self-report questionnaire measuring distress, depression, anxiety and somatization. OBJECTIVE To evaluate the equivalence of the English version of the 4DSQ. METHODS 4DSQ data of English and Dutch speaking general practice attendees were analysed and compared. The English speaking group consisted of 205 attendees, aged 18-64 years, in general practice, in Canada whereas the Dutch group consisted of 302 general practice attendees in the Netherlands. Differential item functioning (DIF) analysis was conducted using the Mantel-Haenszel method and ordinal logistic regression. Differential test functioning (DTF; i.e., the scale impact of DIF) was evaluated using linear regression analysis. RESULTS DIF was detected in 2/16 distress items, 2/6 depression items, 2/12 anxiety items, and 1/16 somatization items. With respect to mean scale scores, the impact of DIF on the scale level was negligible for all scales. On the anxiety scale DIF caused the English speaking patients with moderate to severe anxiety to score about one point lower than Dutch patients with the same anxiety level. CONCLUSION The English 4DSQ measures the same constructs like the original Dutch 4DSQ. The distress, depression and somatization scales can employ the same cut-off points as the corresponding Dutch scales. However, cut-off points of the English 4DSQ anxiety scale should be lowered by one point to retain the same meaning as the Dutch anxiety cut-off points.
Collapse
Affiliation(s)
- Berend Terluin
- Department of General Practice and Elderly Care Medicine & EMGO Institute for Health and Care Research, VU University Medical Centre , Amsterdam , the Netherlands
| | | | | |
Collapse
|
45
|
Retelsdorf J, Bauer J, Gebauer SK, Kauper T, Möller J. Erfassung berufsbezogener Selbstkonzepte von angehenden Lehrkräften (ERBSE-L). DIAGNOSTICA 2014. [DOI: 10.1026/0012-1924/a000108] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/04/2022]
Abstract
Im vorliegenden Beitrag wird ein Instrument zur mehrdimensionalen Erfassung berufsbezogener Selbstkonzepte von angehenden Lehrkräften (ERBSE-L) vorgestellt. In einer ersten Studie mit N = 484 Lehramtsstudierenden wurden mittels exploratorischer Faktorenanalysen die Selbstkonzeptdimensionen Fach, Innovation, Medien, Diagnostik, Erziehung und Beratung extrahiert. In einer zweiten Studie konnte diese faktorielle Struktur mittels konfirmatorischer Faktorenanalysen an einer Stichprobe von N = 5 802 Lehramtsstudierenden repliziert werden. In beiden Studien ergaben sich hinreichende interne Konsistenzen der sechs Dimensionen (α ≥ .71). Zudem konnte die Annahme von Messinvarianz der Selbstkonzeptdimensionen für Geschlecht, gymnasiales vs. nicht-gymnasiales Lehramt, Studienphase (Studienanfänger vs. fortgeschritten Studierende) und über die Zeit gestützt werden. Die Ergebnisse zu erwarteten Mittelwertsunterschieden zwischen den Geschlechtern bzw. zwischen den Lehrämtern sowie Zusammenhängen der Selbstkonzeptdimensionen mit der Studienwahlmotivation und Studienleistungen liefern weitere Hinweise auf die Validität der Skalen. Insgesamt erwies sich ERBSE-L als vielversprechendes Instrument für die Erfassung mehrerer Dimensionen des berufsbezogenen Selbstkonzepts von angehenden Lehrkräften.
Collapse
|
46
|
DuVernet AM, Wright NA, Meade AW, Coughlin C, Kantrowitz TM. General Mental Ability as a Source of Differential Functioning in Personality Scales. ORGANIZATIONAL RESEARCH METHODS 2014. [DOI: 10.1177/1094428114525996] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/17/2022]
Abstract
Despite pervasive evidence that general mental ability and personality are unrelated, we investigated whether general mental ability may affect the response process associated with personality measurement. Study 1 examined a large sample of job applicant responses to four personality scales for differential functioning across groups of differing general mental ability. While results indicated that personality items differentially function across highly disparate general mental ability groups, there was little evidence of differential functioning across groups with similar levels of general mental ability. Study 2 replicated these findings in a different sample, using a different measure of general mental ability. We posit that observed differences in the psychometric properties of these personality scales are likely due to the information processing capabilities of the respondents. Additionally, we describe how differential functioning analyses can be used during scale development as a method of identifying items that are not appropriate for all intended respondents. In so doing, we demonstrate procedures for examining other construct-measurement interactions in which respondents’ standings on a specific construct could influence their interpretation of and response to items assessing other constructs.
Collapse
Affiliation(s)
| | - Natalie A. Wright
- Department of Psychology and Counseling, Valdosta State University, Valdosta, GA, USA
| | - Adam W. Meade
- Department of Psychology, North Carolina State University, Raleigh, NC, USA
| | | | | |
Collapse
|
47
|
Janulis P. Improving measurement of injection drug risk behavior using item response theory. THE AMERICAN JOURNAL OF DRUG AND ALCOHOL ABUSE 2013; 40:143-50. [PMID: 24266632 DOI: 10.3109/00952990.2013.848212] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/13/2022]
Abstract
BACKGROUND Recent research highlights the multiple steps to preparing and injecting drugs and the resultant viral threats faced by drug users. This research suggests that more sensitive measurement of injection drug HIV risk behavior is required. In addition, growing evidence suggests there are gender differences in injection risk behavior. However, the potential for differential item functioning between genders has not been explored. OBJECTIVES To explore item response theory as an improved measurement modeling technique that provides empirically justified scaling of injection risk behavior and to examine for potential gender-based differential item functioning. METHODS Data is used from three studies in the National Institute on Drug Abuse's Criminal Justice Drug Abuse Treatment Studies. A two-parameter item response theory model was used to scale injection risk behavior and logistic regression was used to examine for differential item functioning. RESULTS Item fit statistics suggest that item response theory can be used to scale injection risk behavior and these models can provide more sensitive estimates of risk behavior. Additionally, gender-based differential item functioning is present in the current data. CONCLUSION Improved measurement of injection risk behavior using item response theory should be encouraged as these models provide increased congruence between construct measurement and the complexity of injection-related HIV risk. Suggestions are made to further improve injection risk behavior measurement. Furthermore, results suggest direct comparisons of composite scores between males and females may be misleading and future work should account for differential item functioning before comparing levels of injection risk behavior.
Collapse
Affiliation(s)
- Patrick Janulis
- Department of Psychology, Michigan State University , East Lansing, MI , USA
| |
Collapse
|
48
|
Berry CM, Barratt CL, Dovalina CL, Zhao P. Can racial/ethnic subgroup criterion-to-test standard deviation ratios account for conflicting differential validity and differential prediction evidence for cognitive ability tests? JOURNAL OF OCCUPATIONAL AND ORGANIZATIONAL PSYCHOLOGY 2013. [DOI: 10.1111/joop.12036] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
Affiliation(s)
| | - Clare L. Barratt
- Department of Psychology; Texas A&M University; College Station, Texas USA
| | | | - Peng Zhao
- Department of Psychology; Texas A&M University; College Station, Texas USA
| |
Collapse
|
49
|
Scherbaum CA, Sabet J, Kern MJ, Agnello P. Examining faking on personality inventories using unfolding item response theory models. J Pers Assess 2012; 95:207-16. [PMID: 23030769 DOI: 10.1080/00223891.2012.725439] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/27/2022]
Abstract
A concern about personality inventories in diagnostic and decision-making contexts is that individuals will fake. Although there is extensive research on faking, little research has focused on how perceptions of personality items change when individuals are faking or responding honestly. This research demonstrates how the delta parameter from the generalized graded unfolding item response theory model can be used to examine how individuals' perceptions about personality items might change when responding honestly or when faking. The results indicate that perceptions changed from honest to faking conditions for several neuroticism items. The direction of the change varied, indicating that faking can operate to increase or decrease scores within a personality factor.
Collapse
Affiliation(s)
- Charles A Scherbaum
- Department of Psychology, Baruch College, City University of New York, New York, NY 10010, USA.
| | | | | | | |
Collapse
|
50
|
Teresi JA, Ramirez M, Jones RN, Choi S, Crane PK. Modifying measures based on differential item functioning (DIF) impact analyses. J Aging Health 2012; 24:1044-76. [PMID: 22422759 PMCID: PMC4030595 DOI: 10.1177/0898264312436877] [Citation(s) in RCA: 31] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/17/2022]
Abstract
OBJECTIVE Measure modification can impact comparability of scores across groups and settings. Changes in items can affect the percent admitting to a symptom. METHODS Using item response theory (IRT) methods, well-calibrated items can be used interchangeably, and the exact same item does not have to be administered to each respondent, theoretically permitting wider latitude in terms of modification. RESULTS Recommendations regarding modifications vary, depending on the use of the measure. In the context of research, adjustments can be made at the analytic level by freeing and fixing parameters based on findings of differential item functioning (DIF). The consequences of DIF for clinical decision making depend on whether or not the patient's performance level approaches the scale decision cutpoint. High-stakes testing may require item removal or separate calibrations to ensure accurate assessment. DISCUSSION Guidelines for modification based on DIF analyses and illustrations of the impact of adjustments are presented.
Collapse
|