1
|
Ellis JL, Sijtsma K. Proof of Reliability Convergence to 1 at Rate of Spearman-Brown Formula for Random Test Forms and Irrespective of Item Pool Dimensionality. PSYCHOMETRIKA 2024; 89:774-795. [PMID: 38472632 DOI: 10.1007/s11336-024-09956-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 10/12/2023] [Accepted: 01/23/2024] [Indexed: 03/14/2024]
Abstract
It is shown that the psychometric test reliability, based on any true-score model with randomly sampled items and uncorrelated errors, converges to 1 as the test length goes to infinity, with probability 1, assuming some general regularity conditions. The asymptotic rate of convergence is given by the Spearman-Brown formula, and for this it is not needed that the items are parallel, or latent unidimensional, or even finite dimensional. Simulations with the 2-parameter logistic item response theory model reveal that the reliability of short multidimensional tests can be positively biased, meaning that applying the Spearman-Brown formula in these cases would lead to overprediction of the reliability that results from lengthening a test. However, test constructors of short tests generally aim for short tests that measure just one attribute, so that the bias problem may have little practical relevance. For short unidimensional tests under the 2-parameter logistic model reliability is almost unbiased, meaning that application of the Spearman-Brown formula in these cases of greater practical utility leads to predictions that are approximately unbiased.
Collapse
Affiliation(s)
- Jules L Ellis
- Faculty of Psychology, Open University of The Netherlands, Heerlen, The Netherlands.
- Radboud University Nijmegen, Nijmegen, The Netherlands.
| | | |
Collapse
|
2
|
Delporte M, Molenberghs G, Fieuws S, Verbeke G. A joint normal-ordinal (probit) model for ordinal and continuous longitudinal data. Biostatistics 2024:kxae014. [PMID: 38869057 DOI: 10.1093/biostatistics/kxae014] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/03/2023] [Revised: 05/06/2024] [Accepted: 05/09/2024] [Indexed: 06/14/2024] Open
Abstract
In biomedical studies, continuous and ordinal longitudinal variables are frequently encountered. In many of these studies it is of interest to estimate the effect of one of these longitudinal variables on the other. Time-dependent covariates have, however, several limitations; they can, for example, not be included when the data is not collected at fixed intervals. The issues can be circumvented by implementing joint models, where two or more longitudinal variables are treated as a response and modeled with a correlated random effect. Next, by conditioning on these response(s), we can study the effect of one or more longitudinal variables on another. We propose a normal-ordinal(probit) joint model. First, we derive closed-form formulas to estimate the model-based correlations between the responses on their original scale. In addition, we derive the marginal model, where the interpretation is no longer conditional on the random effects. As a consequence, we can make predictions for a subvector of one response conditional on the other response and potentially a subvector of the history of the response. Next, we extend the approach to a high-dimensional case with more than two ordinal and/or continuous longitudinal variables. The methodology is applied to a case study where, among others, a longitudinal ordinal response is predicted with a longitudinal continuous variable.
Collapse
Affiliation(s)
- Margaux Delporte
- Department of Public Health & Primary Care, Leuven Biostatistics and Statistical Bioinformatics Centre, Kapucijnenvoer 7 - box 7001, 3000 Leuven, Belgium
| | - Geert Molenberghs
- Department of Public Health & Primary Care, Leuven Biostatistics and Statistical Bioinformatics Centre, Kapucijnenvoer 7 - box 7001, 3000 Leuven, Belgium
- Data Science Institute, Interuniversity Biostatistics and Statistical Bioinformatics Centre, Agoralaan Gebouw D-B -3590 Diepenbeek, Belgium
| | - Steffen Fieuws
- Department of Public Health & Primary Care, Leuven Biostatistics and Statistical Bioinformatics Centre, Kapucijnenvoer 7 - box 7001, 3000 Leuven, Belgium
| | - Geert Verbeke
- Department of Public Health & Primary Care, Leuven Biostatistics and Statistical Bioinformatics Centre, Kapucijnenvoer 7 - box 7001, 3000 Leuven, Belgium
- Data Science Institute, Interuniversity Biostatistics and Statistical Bioinformatics Centre, Agoralaan Gebouw D-B -3590 Diepenbeek, Belgium
| |
Collapse
|
3
|
Metsämuuronen J. Typology of Deflation-Corrected Estimators of Reliability. Front Psychol 2022; 13:891959. [PMID: 35923730 PMCID: PMC9341485 DOI: 10.3389/fpsyg.2022.891959] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/08/2022] [Accepted: 05/30/2022] [Indexed: 02/01/2023] Open
Abstract
The reliability of a test score is discussed from the viewpoint of underestimation of and, specifically, deflation in estimates or reliability. Many widely used estimators are known to underestimate reliability. Empirical cases have shown that estimates by widely used estimators such as alpha, theta, omega, and rho may be deflated by up to 0.60 units of reliability or even more, with certain types of datasets. The reason for this radical deflation lies in the item-score correlation (Rit) embedded in the estimators: because the estimates by Rit are deflated when the number of categories in scales are far from each other, as is always the case with item and score, the estimates of reliability are deflated as well. A short-cut method to reach estimates closer to the true magnitude, new types of estimators, and deflation-corrected estimators of reliability (DCERs), are studied in the article. The empirical section is a study on the characteristics of combinations of DCERs formed by different bases for estimators (alpha, theta, omega, and rho), different alternative estimators of correlation as the linking factor between item and the score variable, and different conditions. Based on the simulation, an initial typology of the families of DCERs is presented: some estimators are better with binary items and some with polytomous items; some are better with small sample sizes and some with larger ones.
Collapse
Affiliation(s)
- Jari Metsämuuronen
- Finnish Education Evaluation Centre, Helsinki, Finland
- Centre for Learning Analytics, University of Turku, Turku, Finland
| |
Collapse
|
4
|
Metsämuuronen J. Deflation-Corrected Estimators of Reliability. Front Psychol 2022; 12:748672. [PMID: 35069327 PMCID: PMC8781775 DOI: 10.3389/fpsyg.2021.748672] [Citation(s) in RCA: 6] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/28/2021] [Accepted: 11/15/2021] [Indexed: 12/01/2022] Open
Abstract
Underestimation of reliability is discussed from the viewpoint of deflation in estimates of reliability caused by artificial systematic technical or mechanical error in the estimates of correlation (MEC). Most traditional estimators of reliability embed product-moment correlation coefficient (PMC) in the form of item-score correlation (Rit) or principal component or factor loading (λ i ). PMC is known to be severely affected by several sources of deflation such as the difficulty level of the item and discrepancy of the scales of the variables of interest and, hence, the estimates by Rit and λ i are always deflated in the settings related to estimating reliability. As a short-cut to deflation-corrected estimators of reliability, this article suggests a procedure where Rit and λ i in the estimators of reliability are replaced by alternative estimators of correlation that are less deflated. These estimators are called deflation-corrected estimators of reliability (DCER). Several families of DCERs are proposed and their behavior is studied by using polychoric correlation coefficient, Goodman-Kruskal gamma, and Somers delta as examples of MEC-corrected coefficients of correlation.
Collapse
Affiliation(s)
- Jari Metsämuuronen
- Finnish National Education Evaluation Centre (FINEEC), Helsinki, Finland
| |
Collapse
|
5
|
Cho SJ, Shen J, Naveiras M. Multilevel Reliability Measures of Latent Scores Within an Item Response Theory Framework. MULTIVARIATE BEHAVIORAL RESEARCH 2019; 54:856-881. [PMID: 31215245 DOI: 10.1080/00273171.2019.1596780] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/09/2023]
Abstract
This paper evaluated multilevel reliability measures in two-level nested designs (e.g., students nested within teachers) within an item response theory framework. A simulation study was implemented to investigate the behavior of the multilevel reliability measures and the uncertainty associated with the measures in various multilevel designs regarding the number of clusters, cluster sizes, and intraclass correlations (ICCs), and in different test lengths, for two parameterizations of multilevel item response models with separate item discriminations or the same item discrimination over levels. Marginal maximum likelihood estimation (MMLE)-multiple imputation and Bayesian analysis were employed to evaluate the accuracy of the multilevel reliability measures and the empirical coverage rates of Monte Carlo (MC) confidence or credible intervals. Considering the accuracy of the multilevel reliability measures and the empirical coverage rate of the intervals, the results lead us to generally recommend MMLE-multiple imputation. In the model with separate item discriminations over levels, marginally acceptable accuracy of the multilevel reliability measures and empirical coverage rate of the MC confidence intervals were found in a limited condition, 200 clusters, 30 cluster size, .2 ICC, and 40 items, in MMLE-multiple imputation. In the model with the same item discrimination over levels, the accuracy of the multilevel reliability measures and the empirical coverage rate of the MC confidence intervals were acceptable in all multilevel designs we considered with 40 items under MMLE-multiple imputation. We discuss these findings and provide guidelines for reporting multilevel reliability measures.
Collapse
|
6
|
van den Heuvel ER, Griffith LE, Sohel N, Fortier I, Muniz-Terrera G, Raina P. Latent variable models for harmonization of test scores: A case study on memory. Biom J 2019; 62:34-52. [PMID: 31583767 DOI: 10.1002/bimj.201800146] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/24/2018] [Revised: 07/02/2019] [Accepted: 08/09/2019] [Indexed: 01/08/2023]
Abstract
Combining data from different studies has a long tradition within the scientific community. It requires that the same information is collected from each study to be able to pool individual data. When studies have implemented different methods or used different instruments (e.g., questionnaires) for measuring the same characteristics or constructs, the observed variables need to be harmonized in some way to obtain equivalent content information across studies. This paper formulates the main concepts for harmonizing test scores from different observational studies in terms of latent variable models. The concepts are formulated in terms of calibration, invariance, and exchangeability. Although similar ideas are present in measurement reliability and test equating, harmonization is different from measurement invariance and generalizes test equating. In addition, if a test score needs to be transformed to another test score, harmonization of variables is only possible under specific conditions. Observed test scores that connect all of the different studies, are necessary to be able to test the underlying assumptions of harmonization. The concepts of harmonization are illustrated on multiple memory test scores from three different Canadian studies.
Collapse
Affiliation(s)
- Edwin R van den Heuvel
- Department of Mathematics and Computer Science, Eindhoven University of Technology, Eindhoven, The Netherlands
| | - Lauren E Griffith
- Department of Health Research Methods, Evidence, and Impact, McMaster University, Hamilton, Ontario, Canada
| | - Nazmul Sohel
- Department of Health Research Methods, Evidence, and Impact, McMaster University, Hamilton, Ontario, Canada
| | - Isabel Fortier
- Research Institute - McGill University Health Centre, Montreal, Quebec, Canada
| | | | - Parminder Raina
- Department of Health Research Methods, Evidence, and Impact, McMaster University, Hamilton, Ontario, Canada
| |
Collapse
|
7
|
Dumas D, Dong Y. Development and calibration of the student opportunities for deeper learning instrument. PSYCHOLOGY IN THE SCHOOLS 2019. [DOI: 10.1002/pits.22292] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022]
Affiliation(s)
- Denis Dumas
- Department of Research Methods and Information ScienceUniversity of Denver Denver Colorado
| | - Yixiao Dong
- Department of Research Methods and Information ScienceUniversity of Denver Denver Colorado
| |
Collapse
|
8
|
Smits N, van der Ark LA, Conijn JM. Measurement versus prediction in the construction of patient-reported outcome questionnaires: can we have our cake and eat it? Qual Life Res 2018; 27:1673-1682. [PMID: 29098607 PMCID: PMC5997739 DOI: 10.1007/s11136-017-1720-4] [Citation(s) in RCA: 19] [Impact Index Per Article: 3.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 10/12/2017] [Indexed: 02/07/2023]
Abstract
BACKGROUND Two important goals when using questionnaires are (a) measurement: the questionnaire is constructed to assign numerical values that accurately represent the test taker's attribute, and (b) prediction: the questionnaire is constructed to give an accurate forecast of an external criterion. Construction methods aimed at measurement prescribe that items should be reliable. In practice, this leads to questionnaires with high inter-item correlations. By contrast, construction methods aimed at prediction typically prescribe that items have a high correlation with the criterion and low inter-item correlations. The latter approach has often been said to produce a paradox concerning the relation between reliability and validity [1-3], because it is often assumed that good measurement is a prerequisite of good prediction. OBJECTIVE To answer four questions: (1) Why are measurement-based methods suboptimal for questionnaires that are used for prediction? (2) How should one construct a questionnaire that is used for prediction? (3) Do questionnaire-construction methods that optimize measurement and prediction lead to the selection of different items in the questionnaire? (4) Is it possible to construct a questionnaire that can be used for both measurement and prediction? ILLUSTRATIVE EXAMPLE An empirical data set consisting of scores of 242 respondents on questionnaire items measuring mental health is used to select items by means of two methods: a method that optimizes the predictive value of the scale (i.e., forecast a clinical diagnosis), and a method that optimizes the reliability of the scale. We show that for the two scales different sets of items are selected and that a scale constructed to meet the one goal does not show optimal performance with reference to the other goal. DISCUSSION The answers are as follows: (1) Because measurement-based methods tend to maximize inter-item correlations by which predictive validity reduces. (2) Through selecting items that correlate highly with the criterion and lowly with the remaining items. (3) Yes, these methods may lead to different item selections. (4) For a single questionnaire: Yes, but it is problematic because reliability cannot be estimated accurately. For a test battery: Yes, but it is very costly. Implications for the construction of patient-reported outcome questionnaires are discussed.
Collapse
Affiliation(s)
- Niels Smits
- Research Institute of Child Development and Education, University of Amsterdam, Nieuwe Achtergracht 127, 1018 WS, Amsterdam, The Netherlands.
| | - L Andries van der Ark
- Research Institute of Child Development and Education, University of Amsterdam, Nieuwe Achtergracht 127, 1018 WS, Amsterdam, The Netherlands
| | - Judith M Conijn
- Research Institute of Child Development and Education, University of Amsterdam, Nieuwe Achtergracht 127, 1018 WS, Amsterdam, The Netherlands
| |
Collapse
|
9
|
Grygiel P, Humenny G, Rębisz S. Using the De Jong Gierveld Loneliness Scale With Early Adolescents: Factor Structure, Reliability, Stability, and External Validity. Assessment 2016; 26:151-165. [PMID: 27932403 DOI: 10.1177/1073191116682298] [Citation(s) in RCA: 18] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022]
Abstract
The present investigation is the first examination of the factor structures, reliability, external validity, longitudinal invariance, and stability of the De Jong Gierveld Loneliness Scale (DJGLS), as used with early adolescents. It is based on a two-wave, large, representative sample of Polish primary school pupils. The results demonstrate that the model most reflective of the factor structure of the DJGLS is the bifactor model, which assumes the occurrence of one, highly reliable, general factor (overall sense of loneliness) and two, relatively irrelevant, subfactors. Essential unidimensionality (the general factor accounting for three fourth of the common variance) suggest that the interpretation of the subfactors over and above the general factor is inappropriate. The longitudinal confirmatory factor analysis indicated that the bifactor structure of the DJGLS is invariant over time. Correlations with self-rated loneliness, sociometric acceptance/rejection, social self-efficacy, identification with class group, family structure, and gender provide support for the validity of the DJGLS. This implies that it could be used as a measure of loneliness in adolescence, which does not involve references to the school context, making it possible to conduct studies that go beyond school period and compare the intensity of the feeling of loneliness in that group with other age groups.
Collapse
Affiliation(s)
- Paweł Grygiel
- 1 The Educational Research Institute, Warsaw, Poland
| | | | | |
Collapse
|
10
|
Cho SJ, Goodwin AP. Modeling Learning in Doubly Multilevel Binary Longitudinal Data Using Generalized Linear Mixed Models: An Application to Measuring and Explaining Word Learning. PSYCHOMETRIKA 2016; 82:10.1007/s11336-016-9496-y. [PMID: 27038452 DOI: 10.1007/s11336-016-9496-y] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 05/03/2015] [Indexed: 06/05/2023]
Abstract
When word learning is supported by instruction in experimental studies for adolescents, word knowledge outcomes tend to be collected from complex data structure, such as multiple aspects of word knowledge, multilevel reader data, multilevel item data, longitudinal design, and multiple groups. This study illustrates how generalized linear mixed models can be used to measure and explain word learning for data having such complexity. Results from this application provide deeper understanding of word knowledge than could be attained from simpler models and show that word knowledge is multidimensional and depends on word characteristics and instructional contexts.
Collapse
Affiliation(s)
- Sun-Joo Cho
- Vanderbilt University's Peabody College, Nashville, TN, USA.
| | | |
Collapse
|