51
|
Zou G, Klar N. A Non-iterative Confidence Interval Estimating Procedure for the Intraclass Kappa Statistic with Multinomial Outcomes. Biom J 2005; 47:682-90. [PMID: 16385909 DOI: 10.1002/bimj.200310154] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/11/2022]
Abstract
We obtain the asymptotic sample variance of the intraclass kappa statistic for multinomial outcome data. A modified Wald type procedure based on this theory is then used for confidence interval construction. The results of a simulation study show that the proposed non-iterative approach performs very well in terms of confidence interval coverage and width for samples as small as 50. The procedure is illustrated with two examples from previously published medical studies.
Collapse
Affiliation(s)
- Guangyong Zou
- Robarts Clinical Trials, Robarts Research Institute, 100 Perth Drive, London, Ontario, Canada N6A 5K8.
| | | |
Collapse
|
52
|
Reid GJ, McGrath PJ, Lang BA. Parent-child interactions among children with juvenile fibromyalgia, arthritis, and healthy controls. Pain 2005; 113:201-10. [PMID: 15621381 DOI: 10.1016/j.pain.2004.10.018] [Citation(s) in RCA: 53] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/10/2004] [Revised: 10/08/2004] [Accepted: 10/18/2004] [Indexed: 11/25/2022]
Abstract
Parent-child interactions during pain-inducing exercise tasks among children (11-17 years old) with fibromyalgia, juvenile rheumatoid arthritis, and pain-free controls were examined and the contribution of parent-child interactions to disability was tested. Fifteen children in each of the three diagnostic groups and their parents completed 5-min exercise tasks and completed questionnaire measures of disability (Functional Disability Inventory) and coping (Pain Coping Questionnaire). There were few group differences in parent-child interactions. After controlling for children's ratings of pain evoked by the exercise, group differences in interactions during exercise tasks were no longer significant. Sequential analyses, controlling for group and exercise task, revealed that when parents made statements discouraging coping following children's negative verbalizations about the task or pain, children were less likely to be on task, compared to when parents made statements encouraging coping or when parents made any other statements. Children's general pain coping strategies were not related to parent-child interactions. Parent-child interactions were generally not related to disability. Across the groups, more pain and less time on task during the exercises were related to Functional Disability Inventory scores and more school absences. Parent-child interaction patterns influence children's adaptation to pain during experimental tasks. Parents' discouragement of coping in response to their children's negative statements related to the pain or the pain-evoking task are counter productive to children's ability to maintain activity in a mildly painful situation.
Collapse
Affiliation(s)
- Graham J Reid
- Psychology, IWK Health Centre and Dalhousie University, Halifax, NS, Canada.
| | | | | |
Collapse
|
53
|
Dunn WR, Wolf BR, Amendola A, Andrish JT, Kaeding C, Marx RG, McCarty EC, Parker RD, Wright RW, Spindler KP. Multirater agreement of arthroscopic meniscal lesions. Am J Sports Med 2004; 32:1937-40. [PMID: 15572324 DOI: 10.1177/0363546504264586] [Citation(s) in RCA: 52] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Indexed: 01/31/2023]
Abstract
BACKGROUND Establishing the validity of classification schemes is a crucial preparatory step that should precede multicenter studies. There are no studies investigating the reproducibility of arthroscopic classification of meniscal pathology among multiple surgeons at different institutions. HYPOTHESIS Arthroscopic classification of meniscal pathology is reliable and reproducible and suitable for multicenter studies that involve multiple surgeons. STUDY DESIGN Multirater agreement study. METHODS Seven surgeons reviewed a video of 18 meniscal tears and completed a meniscal classification questionnaire. Multirater agreement was calculated based on the proportion of agreement, the kappa coefficient, and the intraclass correlation coefficient. RESULTS There was a 46% agreement on the central/peripheral location of tears (kappa = 0.30), an 80% agreement on the depth of tears (kappa = 0.46), a 72% agreement on the presence of a degenerative component (kappa = 0.44), a 71% agreement on whether lateral tears were central to the popliteal hiatus (kappa = 0.42), a 73% agreement on the type of tear (kappa = 0.63), an 87% agreement on the location of the tear (kappa = 0.61), and an 84% agreement on the treatment of tears (kappa = 0.66). There was considerable agreement among surgeons on length, with an intraclass correlation coefficient of 0.78, 95% confidence interval of 0.57 to 0.92, and P < .001. CONCLUSIONS Arthroscopic grading of meniscal pathology is reliable and reproducible. CLINICAL RELEVANCE Surgeons can reliably classify meniscal pathology and agree on treatment, which is important for multicenter trials.
Collapse
Affiliation(s)
- Warren R Dunn
- Hospital for Special Surgery, New York, New York, USA.
| | | | | | | | | | | | | | | | | | | |
Collapse
|
54
|
|
55
|
Lester Kirchner H, Lemke JH. Simultaneous estimation of intrarater and interrater agreement for multiple raters under order restrictions for a binary trait. Stat Med 2002; 21:1761-72. [PMID: 12111910 DOI: 10.1002/sim.1138] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022]
Abstract
It is valuable in many studies to assess both intrarater and interrater agreement. Most measures of intrarater agreement do not adjust for unequal estimates of prevalence between the separate rating occasions for a given rater and measures of interrater agreement typically ignore data from the second set of assessments when raters make duplicate assessments. In the event when both measures are assessed there are instances where interrater agreement is larger than at least one of the corresponding intrarater agreements. This implies that a rater agrees less with him/herself and more with another rater. In the situation of multiple raters making duplicate assessments on all subjects, the authors propose properties for an agreement measure based on the odds ratio for a dichotomous trait: (i) estimate a single prevalence across two reading occasions for each rater; (ii) estimate pairwise interrater agreement from all available data; (iii) bound the pairwise interrater agreement above by the corresponding intrarater agreements. Estimation of odds ratios under these properties is done by maximizing the multinomial likelihood with constraints using generalized log-linear models in combination with a generalization of the Lemke-Dykstra iterative-incremental algorithm. An example from a mammography examination reliability study is used to demonstrate the new method.
Collapse
Affiliation(s)
- H Lester Kirchner
- Department of Pediatrics, Rainbow Babies and Children's Hospital, Case Western Reserve University, Cleveland, OH 44106-6003, USA.
| | | |
Collapse
|
56
|
Weinfurt KP, Trucco SM, Willke RJ, Schulman KA. Measuring agreement between patient and proxy responses to multidimensional health-related quality-of-life measures in clinical trials. An application of psychometric profile analysis. J Clin Epidemiol 2002; 55:608-18. [PMID: 12063103 DOI: 10.1016/s0895-4356(02)00392-x] [Citation(s) in RCA: 36] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/27/2022]
Abstract
When patients cannot provide responses to health-related quality-of-life (HRQOL) measures in clinical trials, family or friends may be asked to respond. We present a simple, comprehensive method for assessing agreement between patients with head injury and their proxy responders. In contrast to more traditional approaches, this method defines agreement separately for each patient-proxy pair, and compares HRQOL profiles along three dimensions-level, or the average of the ratings; scatter, or the variability in the ratings; and shape, or the ranks of the ratings. We demonstrate this method in the context of a clinical trial of a treatment for traumatic head injury and compare the results to those obtained using traditional analyses. Options for incorporating proxy responses into clinical trial analyses are discussed.
Collapse
Affiliation(s)
- Kevin P Weinfurt
- Center for Clinical and Genetic Economics, Duke Clinical Research Institute, Duke University Medical Center, Durham, NC 17969-27715, USA.
| | | | | | | |
Collapse
|
57
|
Meyer GJ, Hilsenroth MJ, Baxter D, Exner JE, Fowler JC, Piers CC, Resnick J. An examination of interrater reliability for scoring the Rorschach Comprehensive System in eight data sets. J Pers Assess 2002; 78:219-74. [PMID: 12067192 DOI: 10.1207/s15327752jpa7802_03] [Citation(s) in RCA: 105] [Impact Index Per Article: 4.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/31/2022]
Abstract
In this article, we describe interrater reliability for the Comprehensive System (CS; Exner. 1993) in 8 relatively large samples, including (a) students, (b) experienced re- searchers, (c) clinicians, (d) clinicians and then researchers, (e) a composite clinical sample (i.e., a to d), and 3 samples in which randomly generated erroneous scores were substituted for (f) 10%, (g) 20%, or (h) 30% of the original responses. Across samples, 133 to 143 statistically stable CS scores had excellent reliability, with median intraclass correlations of.85, .96, .97, .95, .93, .95, .89, and .82, respectively. We also demonstrate reliability findings from this study closely match the results derived from a synthesis of prior research, CS summary scores are more reliable than scores assigned to individual responses, small samples are more likely to generate unstable and lower reliability estimates, and Meyer's (1997a) procedures for estimating response segment reliability were accurate. The CS can be scored reliably, but because scoring is the result of coder skills clinicians must conscientiously monitor their accuracy.
Collapse
Affiliation(s)
- Gregory J Meyer
- Department of Psychology, University of Alaska, Anchorage 99508, USA.
| | | | | | | | | | | | | |
Collapse
|
58
|
Abstract
Model-based inference procedures for the kappa statistic have developed rapidly over the last decade. However, no method has yet been developed for constructing a confidence interval about a difference between independent kappa statistics that is valid in samples of small to moderate size. In this article, we propose and evaluate two such methods based on an idea proposed by Newcombe (1998, Statistics in Medicine, 17, 873-890) for constructing a confidence interval for a difference between independent proportions. The methods are shown to provide very satisfactory results in sample sizes as small as 25 subjects per group. Sample size requirements that achieve a prespecified expected width for a confidence interval about a difference of kappa statistic are also presented.
Collapse
Affiliation(s)
- Allan Donner
- Department of Epidemiology and Biostatistics, University of Western Ontario, London, Canada.
| | | |
Collapse
|
59
|
Abstract
A latent-class model of rater agreement is presented for which 1 of the model parameters can be interpreted as the proportion of systematic agreement. The latent classes of the model emerge from the factorial combination of the "true" category in which a target belongs and the ease with which raters are able to classify targets into the true category. Several constrained cases of the model are described, and the relations to other well-known agreement models and kappa-type summary coefficients are explained. The differential quality of the rating categories can be assessed on the basis of the model fit. The model is illustrated using data from diagnoses of psychiatric disorders and classifications of individuals in a persuasive communication study.
Collapse
Affiliation(s)
- Christof Schuster
- Department of Psychology, University of Notre Dame, Indiana 46556, USA.
| | | |
Collapse
|
60
|
|
61
|
Berry KL, Mielke PW. Nonasymptotic significance tests for two measures of agreement. Percept Mot Skills 2001; 93:109-14. [PMID: 11693671 DOI: 10.2466/pms.2001.93.1.109] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022]
Abstract
The kappa agreement coefficient of Cohen from 1960 and Brennan and Prediger from 1981 are defined and compared. A FORTRAN program is described that computes Cohen's kappa and Brennan and Prediger's kappa and their associated probability values based on Monte Carlo resampling and the binomial distribution, respectively.
Collapse
Affiliation(s)
- K L Berry
- Colorado State University, Fort Collins 80523-1784, USA.
| | | |
Collapse
|
62
|
|
63
|
Abstract
Cohen's kappa statistic is a very well known measure of agreement between two raters with respect to a dichotomous outcome. Several expressions for its asymptotic variance have been derived and the normal approximation to its distribution has been used to construct confidence intervals. However, information on the accuracy of these normal-approximation confidence intervals is not comprehensive. Under the common correlation model for dichotomous data, we evaluate 95 per cent lower confidence bounds constructed using four asymptotic variance expressions. Exact computation, rather than simulation is employed. Specific conditions under which the use of asymptotic variance formulae is reasonable are determined.
Collapse
Affiliation(s)
- N J Blackman
- SmithKline Beecham, 1250 South Collegeville Road, P.O. Box 5089, Collegeville, PA, 19426-0989, USA.
| | | |
Collapse
|
64
|
Abstract
Procedures are developed and compared for testing the equality of two dependent kappa statistics in the case of two raters and a dichotomous outcome variable. Such problems may arise when each of a sample of subjects are rated under two distinct settings, and it is of interest to compare the observed levels of inter-observer and intra-observer agreement. The procedures compared are extensions of previously developed procedures for comparing kappa statistics computed from independent samples. The results of a Monte Carlo simulation show that adjusting for the dependency between samples tends to be worthwhile only if the between-setting correlation is comparable in magnitude to the within-setting correlations. In this case, a goodness-of-fit procedure that takes into account the dependency between samples is recommended.
Collapse
Affiliation(s)
- A Donner
- Department of Epidemiology and Biostatistics, The University of Western Ontario, London, Ontario, N6A 5C1, Canada
| | | | | | | |
Collapse
|
65
|
Acklin MW, McDowell CJ, Verschell MS, Chan D. Interobserver agreement, intraobserver reliability, and the Rorschach Comprehensive System. J Pers Assess 2000; 74:15-47. [PMID: 10779931 DOI: 10.1207/s15327752jpa740103] [Citation(s) in RCA: 64] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/31/2022]
Abstract
Interrater agreement and reliability for the Rorschach have recently come under increasing scrutiny. This is the second report examining methods of Comprehensive System reliability using principles derived from observational methodology and applied behavioral analysis. This study examined a previous nonpatient sample of 20 protocols (N = 412 responses) and also examined a new clinical sample of 20 protocols (N = 374 responses) diagnosed with Research Diagnostic Criteria. Reliability was analyzed at multiple levels of Comprehensive System data, including response-level individual codes and coding decisions and ratios, percentages, and derivations from the Structural Summary. With a number of exceptions, most Comprehensive System codes, coding decisions, and summary scores yield acceptable, and in many instances excellent, levels of reliability. Limitations arising from the nature of Rorschach data and Comprehensive System coding criteria are discussed.
Collapse
|
66
|
Barr SG, Zonana-Nacach A, Magder LS, Petri M. Patterns of disease activity in systemic lupus erythematosus. ARTHRITIS AND RHEUMATISM 1999; 42:2682-8. [PMID: 10616018 DOI: 10.1002/1529-0131(199912)42:12<2682::aid-anr26>3.0.co;2-6] [Citation(s) in RCA: 128] [Impact Index Per Article: 5.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/06/2022]
Abstract
OBJECTIVE To describe patterns of systemic lupus (SLE) disease activity over time. METHODS Disease activity was measured in a prospective cohort of 204 consecutive SLE patients followed up quarterly for 2.0-7.5 years (911 person-years of followup). Physician's global assessment (PGA) and modified SLE Disease Activity Index (M-SLEDAI; omitting serology) scores were plotted against time for each patient. Definitions for disease activity patterns were developed by consensus using these plots, and the proportion of total follow-up time represented by each pattern was calculated. RESULTS Three patterns of SLE activity were apparent: relapsing-remitting (RR), chronic active (CA), and long quiescent (LQ). The CA pattern was the most frequent for both the PGA and M-SLEDAI, representing 58% and 40% of total person-years, respectively. The least common pattern was LQ (PGA 16%, M-SLEDAI 25%), while the RR pattern was intermediate in frequency (PGA 26%, M-SLEDAI 35%). Average disease activity during RR periods tended to be mild, while that during CA periods was more likely to be moderately severe. The most common discrepancy between instruments was that the PGA depicted CA when the M-SLEDAI showed an RR pattern. The M-SLEDAI did not appear to capture mild degrees of activity. CONCLUSION SLE activity was readily classified into 1 of 3 patterns: RR, CA, or LQ. The CA pattern was most common, suggesting that significant morbidity may arise from persistent disease activity. These findings may have important implications regarding the choice of outcome measures in SLE clinical trials, since comparison of flare rates may not account for chronic disease activity.
Collapse
Affiliation(s)
- S G Barr
- Johns Hopkins University School of Medicine, Baltimore, Maryland 21205, USA
| | | | | | | |
Collapse
|
67
|
Goodman LA, Thompson KM, Weinfurt K, Corl S, Acker P, Mueser KT, Rosenberg SD. Reliability of reports of violent victimization and posttraumatic stress disorder among men and women with serious mental illness. J Trauma Stress 1999; 12:587-99. [PMID: 10646178 DOI: 10.1023/a:1024708916143] [Citation(s) in RCA: 220] [Impact Index Per Article: 8.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Indexed: 11/12/2022]
Abstract
Although violent victimization is highly prevalent among men and women with serious mental illness (SMI; e.g., schizophrenia, bipolar disorder), future research in this area may be impeded by controversy concerning the ability of individuals with SMI to report traumatic events reliably. This article presents the results of a study exploring the temporal consistency of reports of childhood sexual abuse, adult sexual abuse, and adult physical abuse, as well as current symptoms of posttraumatic stress disorder (PTSD) among 50 people with SMI. Results show that trauma history and PTSD assessments can, for the most part, yield reliable information essential to further research in this area. The study also demonstrates the importance of using a variety of statistical methods to assess the reliability of self-reports of trauma history.
Collapse
Affiliation(s)
- L A Goodman
- Counseling Psychology Program, School of Education, Boston College, MA 02467, USA.
| | | | | | | | | | | | | |
Collapse
|
68
|
Banerjee M, Capozzoli M, McSweeney L, Sinha D. Beyond kappa: A review of interrater agreement measures. CAN J STAT 1999. [DOI: 10.2307/3315487] [Citation(s) in RCA: 551] [Impact Index Per Article: 22.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/17/2022]
|
69
|
|
70
|
Falk R, Well AD. Correlation as Probability of Common Descent. MULTIVARIATE BEHAVIORAL RESEARCH 1996; 31:219-238. [PMID: 26801457 DOI: 10.1207/s15327906mbr3102_4] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/05/2023]
Abstract
We highlight one interpretation of Pearson's r (largely unknown to behavioral scientists), inspired by the genetic measurement of inbreeding. The coefficient of inbreeding, defined as the probability that two paired alleles originate from common descent, equals the correlation between the uniting gametes. We specify the statistical conditions under which r can be interpreted as probability of identity by descent and explore the possibility of generalizing that meaning of correlation beyond the inbreeding context. Extensions to the framework of agreement between judges and to that of sequential dependencies are considered. Viewing correlation as probability is heuristically promising. We examine the implications of this approach in the case of three types of bivariate distributions and discuss potential insights and risks.
Collapse
|
71
|
Goldstein MF, Friedman SR, Neaigus A, Jose B, Ildefonso G, Curtis R. Self-reports of HIV risk behavior by injecting drug users: are they reliable? Addiction 1995; 90:1097-104. [PMID: 7549778 DOI: 10.1046/j.1360-0443.1995.90810978.x] [Citation(s) in RCA: 27] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Indexed: 01/25/2023]
Abstract
While most studies of AIDS risk behavior rely on self-reports, few studies have assessed the reliability of these reports. The present study examines self-reports of drug-related and sexual risk behavior among pairs of injecting drug users (IDUs) recruited from the streets in New York City. Since both members of the pair were interviewed, it was possible to compare their responses in order to assess reliability. Subjects reported on their contacts' demographic data (age, gender, race/ethnicity) and on shared risk behaviors, including syringe sharing. Despite the private and/or illegal nature of AIDS risk behaviors, IDU subjects were generally reliable in their reports of both demographic and AIDS risk behaviors.
Collapse
Affiliation(s)
- M F Goldstein
- National Development and Research Institutes, Inc., New York, New York 10013, USA
| | | | | | | | | | | |
Collapse
|
72
|
Agresti A, Ghosh A, Bini M. Raking Kappa: Describing Potential Impact of Marginal Distributions on Measures of Agreement. Biom J 1995. [DOI: 10.1002/bimj.4710370705] [Citation(s) in RCA: 19] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022]
|
73
|
Cordes AK. The reliability of observational data: I. Theories and methods for speech-language pathology. JOURNAL OF SPEECH AND HEARING RESEARCH 1994; 37:264-278. [PMID: 8028308 DOI: 10.1044/jshr.3702.264] [Citation(s) in RCA: 31] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/22/2023]
Abstract
Much research and clinical work in speech-language pathology depends on the validity and reliability of data gathered through the direct observation of human behavior. This paper reviews several definitions of reliability, concluding that behavior observation data are reliable if they, and the experimental conclusions drawn from them, are not affected by differences among observers or by other variations in the recording context. The theoretical bases of several methods commonly used to estimate reliability for observational data are reviewed, with examples of the use of these methods drawn from a recent volume of the Journal of Speech and Hearing Research (35, 1992). Although most recent research publications in speech-language pathology have addressed the issue of reliability for their observational data to some extent, most reliability estimates do not clearly establish that the data or the experimental conclusions were replicable or unaffected by differences among observers. Suggestions are provided for improving the usefulness of the reliability estimates published in speech-language pathology research.
Collapse
Affiliation(s)
- A K Cordes
- Department of Speech and Hearing Sciences, University of California, Santa Barbara 93106-7050
| |
Collapse
|
74
|
Olsen LH, Overgaard S, Frederiksen P, Ladefoged C, Ludwigsen E, Petri J, Poulsen JT. The reliability of staging and grading of bladder tumours. Impact of misinformation on the pathologist's diagnosis. SCANDINAVIAN JOURNAL OF UROLOGY AND NEPHROLOGY 1993; 27:349-53. [PMID: 8290915 DOI: 10.3109/00365599309180446] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/29/2023]
Abstract
The influence of misinformation on the reliability of the histopathological classification of bladder tumours was analysed. Four consultant pathologists assessed 40 biopsy specimens of bladder tumours staging invasion and grading the specimens according to the Bergkvist classification. A random sample of 20 specimens was accompanied by systematically distorted information ("bias"-unknown to the pathologists) about previous histological grading of the patient (bias group); the other 20 specimens were used as control group (non bias group). After 6 months a second round with the same specimens was arranged to assess the influence of bias on the intraobserver variation. Using kappa (kappa) statistics the chance corrected interobserver agreement rate was poor both in staging of invasion and grading according to the Bergkvist classification (kappa < 0.50). The kappa values in the intraobserver study ranged from poor to excellent with a tendency towards lower kappa when the observer had been biased. The kappa values in the assessment of malignancy were acceptable to excellent. False information did not affect the pathologists' diagnosis significantly.
Collapse
Affiliation(s)
- L H Olsen
- Department of Surgery, Sønderborg Hospital, Denmark
| | | | | | | | | | | | | |
Collapse
|
75
|
Abstract
Since the introduction of Cohen's kappa as a chance-adjusted measure of agreement between two observers, several "paradoxes" in its interpretation have been pointed out. The difficulties occur because kappa not only measures agreement but is also affected in complex ways by the presence of bias between observers and by the distributions of data across the categories that are used ("prevalence"). In this paper, new indices that provide independent measures of bias and prevalence, as well as of observed agreement, are defined and a simple formula is derived that expresses kappa in terms of these three indices. When comparisons are made between agreement studies it can be misleading to report kappa values alone, and it is recommended that researchers also include quantitative indicators of bias and prevalence.
Collapse
Affiliation(s)
- T Byrt
- Clinical Epidemiology and Biostatistics Unit, Royal Children's Hospital Research Foundation, Parkville, Victoria, Australia
| | | | | |
Collapse
|
76
|
Hassebrock F, Prietula MJ. A protocol-based coding scheme for the analysis of medical reasoning. ACTA ACUST UNITED AC 1992. [DOI: 10.1016/0020-7373(92)90026-h] [Citation(s) in RCA: 23] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/27/2022]
|
77
|
Abstract
This article presents a survey of ways of statistically modelling patterns of observer agreement and disagreement. Main emphasis is placed on modelling inter-observer agreement for categorical responses, both for nominal and ordinal response scales. Models discussed include (1) simple cell-probability models based on Cohen's kappa that focus on beyond-chance agreement, (2) loglinear models for square tables, such as quasi-independence and quasi-symmetry models, (3) latent class models that express the joint distribution between ratings as a mixture of clusters for homogeneous subjects, each cluster having the same 'true' rating, and 4) Rasch models, which decompose subject-by-observer rating distributions using observer and subject main effects. Models can address two distinct components of agreement--strength of association between ratings, and similarity of marginal distributions of the ratings.
Collapse
Affiliation(s)
- A Agresti
- Department of Statistics, University of Florida, Gainesville 32611
| |
Collapse
|
78
|
Weiss MG, Doongaji DR, Siddhartha S, Wypij D, Pathare S, Bhatawdekar M, Bhave A, Sheth A, Fernandes R. The Explanatory Model Interview Catalogue (EMIC). Contribution to cross-cultural research methods from a study of leprosy and mental health. Br J Psychiatry 1992; 160:819-30. [PMID: 1617366 DOI: 10.1192/bjp.160.6.819] [Citation(s) in RCA: 180] [Impact Index Per Article: 5.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Indexed: 12/27/2022]
Abstract
The Explanatory Model Interview Catalogue (EMIC) has been developed to elicit illness-related perceptions, beliefs, and practices in a cultural study of leprosy and mental health in Bombay. Leprosy is an especially appropriate disorder for studying the inter-relationship of culture, mental health and medical illness because of deeply rooted cultural meanings, the emotional burden, and underuse of effective therapy. Fifty per cent of 56 recently diagnosed leprosy out-patients, 37% of 19 controls with another stigmatised dermatological condition (vitiligo), but only 8% of 12 controls with a comparable non-stigmatised condition (tinea versicolor) met DSM-III-R criteria for an axis I depressive, anxiety or somatoform disorder. Belief in a humoral (traditional) cause of illness predicted better attendance at clinic.
Collapse
Affiliation(s)
- M G Weiss
- Department of Psychiatry, Harvard Medical School, Boston, Massachusetts
| | | | | | | | | | | | | | | | | |
Collapse
|
79
|
Abstract
The utility of the Hand Test as a quick, reliable measure of 100 children's personalities was assessed. The interscorer reliability of the Hand Test was estimated by both intraclass correlations and the Kappa coefficient for 100 children. Following training, satisfactory intraclass correlations were obtained for the Quantitative scores (20 of 22 above .70) and Qualitative scores (12 of 27 above .70) Kappa coefficients were generally lower. Scorers' memory overload and low response frequency are discussed as possible bases for the low reliabilities of Qualitative scores. Although the Hand Test reliability for Quantitative scores is consistent with those of other projective tests, consideration should be given to the modification of the directions of administration for young children and clarification of scoring rules.
Collapse
Affiliation(s)
- D E Carter
- Department of Educational Foundations, State University College, Buffalo, New York 14222
| | | |
Collapse
|
80
|
|
81
|
Castorr AH, Thompson KO, Ryan JW, Phillips CY, Prescott PA, Soeken KL. The process of rater training for observational instruments: implications for interrater reliability. Res Nurs Health 1990; 13:311-8. [PMID: 2236654 DOI: 10.1002/nur.4770130507] [Citation(s) in RCA: 36] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/30/2022]
Abstract
Although the process of rater training is important for establishing interrater reliability of observational instruments, there is little information available in current literature to guide the researcher. In this article, principles and procedures that can be used when rater performance is a critical element of reliability assessment are described. Three phases of the process of rater training are presented: (a) training raters to use the instrument; (b) evaluating rater performance at the end of training; and (c) determining the extent to which rater training is maintained during a reliability study. An example is presented to illustrate how these phases were incorporated in a study to examine the reliability of a measure of patient intensity called the Patient Intensity for Nursing Index (PINI).
Collapse
Affiliation(s)
- A H Castorr
- University of Maryland School of Nursing, Baltimore 21201
| | | | | | | | | | | |
Collapse
|
82
|
Abstract
We investigate the properties of a measure of interrater agreement originally proposed by Rogot and Goldberg. Unlike commonly used measures, this measure not only adjusts for chance agreement, but it also standardizes for both perfect agreement as well as for perfect disagreement. Further, one can also use this measure to assess category specific conditional agreement, and thus apply it to situations with missing main diagonal data. We provide an asymptotic method for inference with this measure.
Collapse
Affiliation(s)
- K F Hirji
- Department of Epidemiology, Faculty of Medicine, University of Dar es Salaam, Tanzania
| | | |
Collapse
|
83
|
Abstract
We describe methods based on latent class analysis for analysis and interpretation of agreement on dichotomous diagnostic ratings. This approach formulates agreement in terms of parameters directly related to diagnostic accuracy and leads to many practical applications, such as estimation of the accuracy of individual ratings and the extent to which accuracy may improve with multiple opinions. We describe refinements in the estimation of parameters for varying panel designs, and apply latent class methods successfully to examples of medical agreement data that include data previously found to be poorly fitted by two-class models. Latent class techniques provide a powerful and flexible set of tools to analyse diagnostic agreement and one should consider them routinely in the analysis of such data.
Collapse
Affiliation(s)
- J S Uebersax
- Behavioral Sciences Department, RAND Corporation, Santa Monica, CA 90406
| | | |
Collapse
|