1
|
Brusco MJ, Steinley D, Watts AL. A comparison of logistic regression methods for Ising model estimation. Behav Res Methods 2023; 55:3566-3584. [PMID: 36266525 DOI: 10.3758/s13428-022-01976-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 09/05/2022] [Indexed: 11/08/2022]
Abstract
The Ising model has received significant attention in network psychometrics during the past decade. A popular estimation procedure is IsingFit, which uses nodewise l1-regularized logistic regression along with the extended Bayesian information criterion to establish the edge weights for the network. In this paper, we report the results of a simulation study comparing IsingFit to two alternative approaches: (1) a nonregularized nodewise stepwise logistic regression method, and (2) a recently proposed global l1-regularized logistic regression method that estimates all edge weights in a single stage, thus circumventing the need for nodewise estimation. MATLAB scripts for the methods are provided as supplemental material. The global l1-regularized logistic regression method generally provided greater accuracy and sensitivity than IsingFit, at the expense of lower specificity and much greater computation time. The stepwise approach showed considerable promise. Relative to the l1-regularized approaches, the stepwise method provided better average specificity for all experimental conditions, as well as comparable accuracy and sensitivity at the largest sample size.
Collapse
Affiliation(s)
- Michael J Brusco
- Department of Business Analytics, Information Systems, and Supply Chain, Florida State University, Tallahassee, FL, USA.
| | - Douglas Steinley
- Department of Psychological Sciences, University of Missouri, Columbia, MO, USA
| | - Ashley L Watts
- Department of Psychological Sciences, University of Missouri, Columbia, MO, USA
| |
Collapse
|
2
|
Brusco MJ, Steinley D, Watts AL. On maximization of the modularity index in network psychometrics. Behav Res Methods 2023; 55:3549-3565. [PMID: 36258108 DOI: 10.3758/s13428-022-01975-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 09/04/2022] [Indexed: 11/08/2022]
Abstract
The modularity index (Q) is an important criterion for many community detection heuristics used in network psychometrics and its subareas (e.g., exploratory graph analysis). Some heuristics seek to directly maximize Q, whereas others, such as the walktrap algorithm, only use the modularity index post hoc to determine the number of communities. Researchers in network psychometrics have typically not employed methods that are guaranteed to find a partition that maximizes Q, perhaps because of the complexity of the underlying mathematical programming problem. In this paper, for networks of the size commonly encountered in network psychometrics, we explore the utility of finding the partition that maximizes Q via formulation and solution of a clique partitioning problem (CPP). A key benefit of the CPP is that the number of communities is naturally determined by its solution and, therefore, need not be prespecified in advance. The results of two simulation studies comparing maximization of Q to two other methods that seek to maximize modularity (fast greedy and Louvain), as well as one popular method that does not (walktrap algorithm), provide interesting insights as to the relative performances of the methods with respect to identification of the correct number of communities and the recovery of underlying community structure.
Collapse
Affiliation(s)
- Michael J Brusco
- Department of Business Analytics, Information Systems, and Supply Chain, Florida State University, Tallahassee, FL, USA.
| | - Douglas Steinley
- Department of Psychological Sciences, University of Missouri, Columbia, MO, USA
| | - Ashley L Watts
- Department of Psychological Sciences, University of Missouri, Columbia, MO, USA
| |
Collapse
|
3
|
Brusco MJ, Steinley D, Watts AL. Disentangling relationships in symptom networks using matrix permutation methods. Psychometrika 2022; 87:133-155. [PMID: 34282531 DOI: 10.1007/s11336-021-09760-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 03/09/2020] [Revised: 02/12/2021] [Accepted: 03/18/2021] [Indexed: 06/13/2023]
Abstract
Common outputs of software programs for network estimation include association matrices containing the edge weights between pairs of symptoms and a plot of the symptom network. Although such outputs are useful, it is sometimes difficult to ascertain structural relationships among symptoms from these types of output alone. We propose that matrix permutation provides a simple, yet effective, approach for clarifying the order relationships among the symptoms based on the edge weights of the network. For directed symptom networks, we use a permutation criterion that has classic applications in electrical circuit theory and economics. This criterion can be used to place symptoms that strongly predict other symptoms at the beginning of the ordering, and symptoms that are strongly predicted by other symptoms at the end. For undirected symptom networks, we recommend a permutation criterion that is based on location theory in the field of operations research. When using this criterion, symptoms with many strong ties tend to be placed centrally in the ordering, whereas weakly-tied symptoms are placed at the ends. The permutation optimization problems are solved using dynamic programming. We also make use of branch-search algorithms for extracting maximum cardinality subsets of symptoms that have perfect structure with respect to a selected criterion. Software for implementing the dynamic programming algorithms is available in MATLAB and R. Two networks from the literature are used to demonstrate the matrix permutation algorithms.
Collapse
|
4
|
Steinley D, Brusco MJ. On Fixed Marginal Distributions and Psychometric Network Models. Multivariate Behav Res 2021; 56:329-335. [PMID: 33960861 DOI: 10.1080/00273171.2021.1895706] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/12/2023]
Abstract
This reply addresses the commentary by Epskamp et al. (in press) on our prior work, of using fixed marginals for sampling the data for testing hypothesis in psychometric network application. Mathematical results are presented for expected column (e.g., item prevalence) and row (e.g., subject severity) probabilities under three classical sampling schemes in categorical data analysis: (i) fixing the density, (ii) fixing either the row or column marginal, or (iii) fixing both the row and column marginal. It is argued that, while a unidimensional structure may not be the model we want, it is the structure we are confronted with given the binary nature of the data. Interpreting network models in the context of this artifactual structure is necessary, with preferred solutions to be expanding the item sets of disorders and moving away from the use of binary data and their associated constraints.
Collapse
|
5
|
Loeffelman JE, Steinley D, Boness CL, Trull TJ, Wood PK, Brusco MJ, Sher KJ. Combinatorial Optimization of Clustering Decisions: An Approach to Refine Psychiatric Diagnoses. Multivariate Behav Res 2021; 56:57-69. [PMID: 32054331 PMCID: PMC8428183 DOI: 10.1080/00273171.2020.1717921] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/10/2023]
Abstract
Using complete enumeration (e.g., generating all possible subsets of item combinations) to evaluate clustering problems has the benefit of locating globally optimal solutions automatically without the concern of sampling variability. The proposed method is meant to combine clustering variables in such a way as to create groups that are maximally different on a theoretically sound derivation variable(s). After the population of all unique sets is permuted, optimization on some predefined, user-specific function can occur. We apply this technique to optimizing the diagnosis of Alcohol Use Disorder. This is a unique application, from a clustering point of view, in that the decision rule for clustering observations into the "diagnosis" group relies on both the set of items being considered and a predefined threshold on the number of items required to be endorsed for the "diagnosis" to occur. In optimizing diagnostic rules, criteria set sizes can be reduced without a loss of significant information when compared to current and proposed, alternative, diagnostic schemes.
Collapse
|
6
|
Brusco MJ, Cradit JD, Steinley D. Combining diversity and dispersion criteria for anticlustering: A bicriterion approach. Br J Math Stat Psychol 2020; 73:375-396. [PMID: 31512759 DOI: 10.1111/bmsp.12186] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 01/26/2018] [Revised: 06/22/2019] [Indexed: 06/10/2023]
Abstract
Most partitioning methods used in psychological research seek to produce homogeneous groups (i.e., groups with low intra-group dissimilarity). However, there are also applications where the goal is to provide heterogeneous groups (i.e., groups with high intra-group dissimilarity). Examples of these anticlustering contexts include construction of stimulus sets, formation of student groups, assignment of employees to project work teams, and assembly of test forms from a bank of items. Unfortunately, most commercial software packages are not equipped to accommodate the objective criteria and constraints that commonly arise for anticlustering problems. Two important objective criteria for anticlustering based on information in a dissimilarity matrix are: a diversity measure based on within-cluster sums of dissimilarities; and a dispersion measure based on the within-cluster minimum dissimilarities. In many instances, it is possible to find a partition that provides a large improvement in one of these two criteria with little (or no) sacrifice in the other criterion. For this reason, it is of significant value to explore the trade-offs that arise between these two criteria. Accordingly, the key contribution of this paper is the formulation of a bicriterion optimization problem for anticlustering based on the diversity and dispersion criteria, along with heuristics to approximate the Pareto efficient set of partitions. A motivating example and computational study are provided within the framework of test assembly.
Collapse
|
7
|
Brusco MJ, Steinley D, Hoffman M, Davis-Stober C, Wasserman S. On Ising models and algorithms for the construction of symptom networks in psychopathological research. Psychol Methods 2019; 24:735-753. [DOI: 10.1037/met0000207] [Citation(s) in RCA: 15] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/08/2022]
|
8
|
Brusco MJ, Cradit JD, Brudvig S. An integrated dominance analysis and dynamic programing approach for measuring predictor importance for customer satisfaction. COMMUN STAT-THEOR M 2019. [DOI: 10.1080/03610926.2018.1510004] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/28/2022]
Affiliation(s)
- Michael J. Brusco
- Department of Business Analytics, Information Systems, and Supply Chain, Florida State University, Tallahassee, Florida, USA
| | - J. Dennis Cradit
- Department of Business Analytics, Information Systems, and Supply Chain, Florida State University, Tallahassee, Florida, USA
| | - Susan Brudvig
- School of Business and Economics, Indiana University East, Richmond, Indiana, USA
| |
Collapse
|
9
|
Affiliation(s)
- Michael J. Brusco
- Department of Business Analytics, Information Systems, and Supply Chain, Florida State University, Tallahassee, Florida, USA
| | - Douglas Steinley
- Department of Psychological Sciences, University of Missouri, Columbia, Missouri, USA
| | - Jordan Stevens
- Department of Psychological Sciences, University of Missouri, Columbia, Missouri, USA
| |
Collapse
|
10
|
Brusco MJ, Steinley D, Köhn HF. Residual analysis for unidimensional scaling in the L2-norm. COMMUN STAT-SIMUL C 2019. [DOI: 10.1080/03610918.2018.1438620] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/17/2022]
Affiliation(s)
- Michael J. Brusco
- Department of Business Analytics, Information Systems, and Supply Chain, Florida State University, Tallahassee, FL, USA
| | - Douglas Steinley
- Department of Psychological Sciences, University of Missouri, Columbia, MO, USA
| | | |
Collapse
|
11
|
Brusco MJ, Voorhees CM, Calantone RJ, Brady MK, Steinley D. Integrating linear discriminant analysis, polynomial basis expansion, and genetic search for two-group classification. COMMUN STAT-SIMUL C 2019. [DOI: 10.1080/03610918.2017.1419262] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/18/2022]
Affiliation(s)
- Michael J. Brusco
- Department of Business Analytics, Information Systems, and Supply Chain, Florida State University, Tallahassee, FL USA
| | - Clay M. Voorhees
- Department of Marketing, Michigan State University, East Lansing, Michigan USA
| | - Roger J. Calantone
- Department of Marketing, Michigan State University, East Lansing, Michigan USA
| | - Michael K. Brady
- Department of Marketing, Florida State University, Tallahassee, Florida, USA
| | - Douglas Steinley
- Department of Psychological Sciences, University of Missouri-Columbia, Columbia, Missouri USA
| |
Collapse
|
12
|
Steinley D, Hoffman M, Brusco MJ, Sher KJ. A method for making inferences in network analysis: Comment on Forbes, Wright, Markon, and Krueger (2017). J Abnorm Psychol 2019; 126:1000-1010. [PMID: 29106283 DOI: 10.1037/abn0000308] [Citation(s) in RCA: 35] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/08/2022]
Abstract
Forbes, Wright, Markon, and Krueger (2017) make a compelling case for proceeding cautiously with respect to the overinterpretation and dissemination of results using the increasingly popular approach of creating "networks" from co-occurrences of psychopathology symptoms. We commend the authors on their initial investigation and their utilization of cross-validation techniques in an effort to capture the stability of a variety of network estimation methods. Such techniques get at the heart of establishing "reproducibility," an increasing focus of concern in both psychology (e.g., Pashler & Wagenmakers, 2012) and science more generally (e.g., Baker, 2016). However, as we will show, the problem is likely worse (or at least more complicated) than they initially indicated. Specifically, for multivariate binary data, the marginal distributions enforce a large degree of structure on the data. We show that some expected measurements-such as commonly used centrality statistics-can have substantially higher values than what would usually be expected. As such, we propose a nonparametric approach to generate confidence intervals through Monte Carlo simulation. We apply the proposed methodology to the National Comorbidity Survey - Replication, provided by Forbes et al., finding that the many of the results are indistinguishable from what would be expected by chance. Further, we discuss the problem of multiple testing and potential issues of applying methods developed for 1-mode networks (e.g., ties within a single set of observations) to 2-mode networks (e.g., ties between 2 distinct sets of entities). When taken together, these issues indicate that the psychometric network models should be employed with extreme caution and interpreted guardedly. (PsycINFO Database Record
Collapse
Affiliation(s)
| | | | - Michael J Brusco
- Department of Business Analytics, Information Systems & Supply Chain, Florida State University
| | - Kenneth J Sher
- Department of Psychological Sciences, University of Missouri
| |
Collapse
|
13
|
Brusco MJ, Steinley D, Stevens J, Cradit JD. Affinity propagation: An exemplar-based tool for clustering in psychological research. Br J Math Stat Psychol 2019; 72:155-182. [PMID: 29633235 DOI: 10.1111/bmsp.12136] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 05/25/2017] [Revised: 01/26/2018] [Indexed: 06/08/2023]
Abstract
Affinity propagation is a message-passing-based clustering procedure that has received widespread attention in domains such as biological science, physics, and computer science. However, its implementation in psychology and related areas of social science is comparatively scant. In this paper, we describe the basic principles of affinity propagation, its relationship to other clustering problems, and the types of data for which it can be used for cluster analysis. More importantly, we identify the strengths and weaknesses of affinity propagation as a clustering tool in general and highlight potential opportunities for its use in psychological research. Numerical examples are provided to illustrate the method.
Collapse
Affiliation(s)
- Michael J Brusco
- Department of Business Analytics, Information Systems, and Supply Chain, Florida State University, Tallahassee, Florida, USA
| | - Douglas Steinley
- Department of Psychological Sciences, University of Missouri, Columbia, Missouri, USA
| | - Jordan Stevens
- Department of Psychological Sciences, University of Missouri, Columbia, Missouri, USA
| | - J Dennis Cradit
- Department of Business Analytics, Information Systems, and Supply Chain, Florida State University, Tallahassee, Florida, USA
| |
Collapse
|
14
|
Stolze HJ, Mollenkopf DA, Thornton L, Brusco MJ, Flint DJ. Supply Chain and Marketing Integration: Tension in Frontline Social Networks. J Supply Chain Manag 2018. [DOI: 10.1111/jscm.12169] [Citation(s) in RCA: 13] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
|
15
|
Abstract
Two expectations of the adjusted Rand index (ARI) are compared. It is shown that the expectation derived by Morey and Agresti (1984, Educational and Psychological Measurement, 44, 33) under the multinomial distribution to approximate the exact expectation from the hypergeometric distribution (Hubert & Arabie, 1985, Journal of Classification, 2, 193) provides a poor approximation, and, in some cases, the difference between the two expectations can increase with the sample size. Proofs concerning the minimum and maximum difference between the two expectations are provided, and it is shown through simulation that the ARI can differ significantly depending on which expectation is used. Furthermore, when compared in a hypothesis testing framework, multinomial approximation overly favours the null hypothesis.
Collapse
|
16
|
Abstract
Cohen's κ, a similarity measure for categorical data, has since been applied to problems in the data mining field such as cluster analysis and network link prediction. In this paper, a new application is examined: community detection in networks. A new algorithm is proposed that uses Cohen's κ as a similarity measure for each pair of nodes; subsequently, the κ values are then clustered to detect the communities. This paper defines and tests this method on a variety of simulated and real networks. The results are compared with those from eight other community detection algorithms. Results show this new algorithm is consistently among the top performers in classifying data points both on simulated and real networks. Additionally, this is one of the broadest comparative simulations for comparing community detection algorithms to date.
Collapse
|
17
|
Abstract
The problem of partitioning a collection of objects based on their measurements on a set of dichotomous variables is a well-established problem in psychological research, with applications including clinical diagnosis, educational testing, cognitive categorization, and choice analysis. Latent class analysis and K-means clustering are popular methods for partitioning objects based on dichotomous measures in the psychological literature. The K-median clustering method has recently been touted as a potentially useful tool for psychological data and might be preferable to its close neighbor, K-means, when the variable measures are dichotomous. We conducted simulation-based comparisons of the latent class, K-means, and K-median approaches for partitioning dichotomous data. Although all 3 methods proved capable of recovering cluster structure, K-median clustering yielded the best average performance, followed closely by latent class analysis. We also report results for the 3 methods within the context of an application to transitive reasoning data, in which it was found that the 3 approaches can exhibit profound differences when applied to real data. (PsycINFO Database Record
Collapse
Affiliation(s)
- Michael J Brusco
- Department of Analytics, Information Systems, & Supply Chain, Florida State University
| | - Emilie Shireman
- Department of Psychological Sciences, University of Missouri
| | | |
Collapse
|
18
|
Abstract
Purpose
The purpose of this paper is twofold. First, the authors provide a survey of operations management (OM) research applications of traditional hierarchical and nonhierarchical clustering methods with respect to key decisions that are central to a valid analysis. Second, the authors offer recommendations for practice with respect to these decisions.
Design/methodology/approach
A coding study was conducted for 97 cluster analyses reported in six OM journals during the period spanning 1994-2015. Data were collected with respect to: variable selection, variable standardization, method, selection of the number of clusters, consistency/stability of the clustering solution, and profiling of the clusters based on exogenous variables. Recommended practices for validation of clustering solutions are provided within the context of this framework.
Findings
There is considerable variability across clustering applications with respect to the components of validation, as well as a mix of productive and undesirable practices. This justifies the importance of the authors’ provision of a schema for conducting a cluster analysis.
Research limitations/implications
Certain aspects of the coding study required some degree of subjectivity with respect to interpretation or classification. However, in light of the sheer magnitude of the coding study (97 articles), the authors are confident that an accurate picture of empirical OM clustering applications has been presented.
Practical implications
The paper provides a critique and synthesis of the practice of cluster analysis in OM research. The coding study provides a thorough foundation for how the key decisions of a cluster analysis have been previously handled in the literature. Both researchers and practitioners are provided with guidelines for performing a valid cluster analysis.
Originality/value
To the best of the authors’ knowledge, no study of this type has been reported in the OM literature. The authors’ recommendations for cluster validation draw from recent studies in other disciplines that are apt to be unfamiliar to many OM researchers.
Collapse
|
19
|
Brusco MJ, Shireman E, Steinley D, Brudvig S, Cradit JD. Gaussian model-based partitioning using iterated local search. Br J Math Stat Psychol 2017; 70:1-24. [PMID: 28130935 DOI: 10.1111/bmsp.12084] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 05/02/2016] [Revised: 08/12/2016] [Indexed: 06/06/2023]
Abstract
The emergence of Gaussian model-based partitioning as a viable alternative to K-means clustering fosters a need for discrete optimization methods that can be efficiently implemented using model-based criteria. A variety of alternative partitioning criteria have been proposed for more general data conditions that permit elliptical clusters, different spatial orientations for the clusters, and unequal cluster sizes. Unfortunately, many of these partitioning criteria are computationally demanding, which makes the multiple-restart (multistart) approach commonly used for K-means partitioning less effective as a heuristic solution strategy. As an alternative, we propose an approach based on iterated local search (ILS), which has proved effective in previous combinatorial data analysis contexts. We compared multistart, ILS and hybrid multistart-ILS procedures for minimizing a very general model-based criterion that assumes no restrictions on cluster size or within-group covariance structure. This comparison, which used 23 data sets from the classification literature, revealed that the ILS and hybrid heuristics generally provided better criterion function values than the multistart approach when all three methods were constrained to the same 10-min time limit. In many instances, these differences in criterion function values reflected profound differences in the partitions obtained.
Collapse
|
20
|
Abstract
It is common knowledge that mixture models are prone to arrive at locally optimal solutions. Typically, researchers are directed to utilize several random initializations to ensure that the resulting solution is adequate. However, it is unknown what factors contribute to a large number of local optima and whether these coincide with the factors that reduce the accuracy of a mixture model. A real-data illustration and a series of simulations are presented that examine the effect of a variety of data structures on the propensity of local optima and the classification quality of the resulting solution. We show that there is a moderately strong relationship between a solution that has a high proportion of local optima and one that is poorly classified.
Collapse
|
21
|
Brusco MJ, Köhn HF, Steinley D. An evaluation of exact methods for the multiple subset maximum cardinality selection problem. Br J Math Stat Psychol 2016; 69:194-213. [PMID: 27027582 DOI: 10.1111/bmsp.12068] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 01/03/2015] [Revised: 02/20/2016] [Indexed: 06/05/2023]
Abstract
The maximum cardinality subset selection problem requires finding the largest possible subset from a set of objects, such that one or more conditions are satisfied. An important extension of this problem is to extract multiple subsets, where the addition of one more object to a larger subset would always be preferred to increases in the size of one or more smaller subsets. We refer to this as the multiple subset maximum cardinality selection problem (MSMCSP). A recently published branch-and-bound algorithm solves the MSMCSP as a partitioning problem. Unfortunately, the computational requirement associated with the algorithm is often enormous, thus rendering the method infeasible from a practical standpoint. In this paper, we present an alternative approach that successively solves a series of binary integer linear programs to obtain a globally optimal solution to the MSMCSP. Computational comparisons of the methods using published similarity data for 45 food items reveal that the proposed sequential method is computationally far more efficient than the branch-and-bound approach.
Collapse
Affiliation(s)
- Michael J Brusco
- Department of Marketing, College of Business, Florida State University, Tallahassee, Florida, USA
| | | | | |
Collapse
|
22
|
Abstract
For 30 years, the adjusted Rand index has been the preferred method for comparing 2 partitions (e.g., clusterings) of a set of observations. Although the index is widely used, little is known about its variability. Herein, the variance of the adjusted Rand index (Hubert & Arabie, 1985) is provided and its properties are explored. It is shown that a normal approximation is appropriate across a wide range of sample sizes and varying numbers of clusters. Further, it is shown that confidence intervals based on the normal distribution have desirable levels of coverage and accuracy. Finally, the first power analysis evaluating the ability to detect differences between 2, different adjusted Rand indices is provided. (PsycINFO Database Record
Collapse
Affiliation(s)
| | | | - Lawrence Hubert
- Department of Psychology, University of Illinois, Champaign-Urbana
| |
Collapse
|
23
|
Brusco MJ, Köhn HF, Steinley D. An Exact Method for Partitioning Dichotomous Items Within the Framework of the Monotone Homogeneity Model. Psychometrika 2015; 80:949-67. [PMID: 25850618 DOI: 10.1007/s11336-015-9459-8] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 04/30/2013] [Indexed: 05/28/2023]
Abstract
The monotone homogeneity model (MHM-also known as the unidimensional monotone latent variable model) is a nonparametric IRT formulation that provides the underpinning for partitioning a collection of dichotomous items to form scales. Ellis (Psychometrika 79:303-316, 2014, doi: 10.1007/s11336-013-9341-5 ) has recently derived inequalities that are implied by the MHM, yet require only the bivariate (inter-item) correlations. In this paper, we incorporate these inequalities within a mathematical programming formulation for partitioning a set of dichotomous scale items. The objective criterion of the partitioning model is to produce clusters of maximum cardinality. The formulation is a binary integer linear program that can be solved exactly using commercial mathematical programming software. However, we have also developed a standalone branch-and-bound algorithm that produces globally optimal solutions. Simulation results and a numerical example are provided to demonstrate the proposed method.
Collapse
Affiliation(s)
- Michael J Brusco
- Florida State University, 821 Academic Way, Tallahassee, FL, 32306-1110 , USA.
| | | | | |
Collapse
|
24
|
Abstract
As network data gains popularity for research in various fields, the need for methods to predict future links or find missing links in the data increases. One subset of the methodology used to solve this problem involves creating a similarity measure between each pair of nodes in the network; unfortunately, these methods can be shown to have arbitrary cutoffs and poor performance. To address these shortcomings, we use the adjusted Rand index to create a similarity measure between nodes that has a natural threshold of zero. The effectiveness of this method is then compared to a number of other similarity measures and assessed on a variety of simulated data sets with block model structure and three real network data sets. Under this particular formulation of the adjusted Rand index, information is also provided on dissimilarity. As such, we then go on to test its use for detecting incorrect links in network data, highlighting the dual use of the approach.
Collapse
|
25
|
Köhn HF, Chiu CY, Brusco MJ. Heuristic cognitive diagnosis when the Q-matrix is unknown. Br J Math Stat Psychol 2015; 68:268-291. [PMID: 25496248 DOI: 10.1111/bmsp.12044] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 10/24/2012] [Revised: 07/12/2014] [Indexed: 06/04/2023]
Abstract
Cognitive diagnosis models of educational test performance rely on a binary Q-matrix that specifies the associations between individual test items and the cognitive attributes (skills) required to answer those items correctly. Current methods for fitting cognitive diagnosis models to educational test data and assigning examinees to proficiency classes are based on parametric estimation methods such as expectation maximization (EM) and Markov chain Monte Carlo (MCMC) that frequently encounter difficulties in practical applications. In response to these difficulties, non-parametric classification techniques (cluster analysis) have been proposed as heuristic alternatives to parametric procedures. These non-parametric classification techniques first aggregate each examinee's test item scores into a profile of attribute sum scores, which then serve as the basis for clustering examinees into proficiency classes. Like the parametric procedures, the non-parametric classification techniques require that the Q-matrix underlying a given test be known. Unfortunately, in practice, the Q-matrix for most tests is not known and must be estimated to specify the associations between items and attributes, risking a misspecified Q-matrix that may then result in the incorrect classification of examinees. This paper demonstrates that clustering examinees into proficiency classes based on their item scores rather than on their attribute sum-score profiles does not require knowledge of the Q-matrix, and results in a more accurate classification of examinees.
Collapse
Affiliation(s)
- Hans-Friedrich Köhn
- Department of Psychology, University of Illinois at Urbana-Champaign, Illinois, USA
| | | | | |
Collapse
|
26
|
Abstract
The minimum-diameter partitioning problem (MDPP) seeks to produce compact clusters, as measured by an overall goodness-of-fit measure known as the partition diameter, which represents the maximum dissimilarity between any two objects placed in the same cluster. Complete-linkage hierarchical clustering is perhaps the best-known heuristic method for the MDPP and has an extensive history of applications in psychological research. Unfortunately, this method has several inherent shortcomings that impede the model selection process, such as: (1) sensitivity to the input order of the objects, (2) failure to obtain a globally optimal minimum-diameter partition when cutting the tree at K clusters, and (3) the propensity for a large number of alternative minimum-diameter partitions for a given K. We propose that each of these problems can be addressed by applying an algorithm that finds all of the minimum-diameter partitions for different values of K. Model selection is then facilitated by considering, for each value of K, the reduction in the partition diameter, the number of alternative optima, and the partition agreement among the alternative optima. Using five examples from the empirical literature, we show the practical value of the proposed process for facilitating model selection for the MDPP.
Collapse
|
27
|
Steinley D, Brusco MJ, Henson R. Principal Cluster Axes: A Projection Pursuit Index for the Preservation of Cluster Structures in the Presence of Data Reduction. Multivariate Behav Res 2012; 47:463-92. [PMID: 26814606 PMCID: PMC5982590 DOI: 10.1080/00273171.2012.673952] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/01/2023]
Abstract
A measure of "clusterability" serves as the basis of a new methodology designed to preserve cluster structure in a reduced dimensional space. Similar to principal component analysis, which finds the direction of maximal variance in multivariate space, principal cluster axes find the direction of maximum clusterability in multivariate space. Furthermore, the principal clustering approach falls into the class of projection pursuit techniques. Comparisons are made with existing methodologies both in a simulation study and analysis of real-world data sets. Furthermore, a demonstration of how to interpret the results of the principal cluster axes is provided on the analysis of Supreme Court voting data and similarities between the interpretation of competing procedures (e.g., factor analysis and principal component analysis) are provided. In addition to the Supreme Court analysis, we analyze several data sets often used to test cluster analysis procedures, including Fisher's Iris data, Agresti's Crab data, and a data set on glass fragments. Finally, discussion is provided to help determine when the proposed procedure will be the most beneficial to the researcher.
Collapse
|
28
|
Abstract
There are a number of important problems in quantitative psychology that require the identification of a permutation of the n rows and columns of an n × n proximity matrix. These problems encompass applications such as unidimensional scaling, paired-comparison ranking, and anti-Robinson forms. The importance of simultaneously incorporating multiple objective criteria in matrix permutation applications is well recognized in the literature; however, to date, there has been a reliance on weighted-sum approaches that transform the multiobjective problem into a single-objective optimization problem. Although exact solutions to these single-objective problems produce supported Pareto efficient solutions to the multiobjective problem, many interesting unsupported Pareto efficient solutions may be missed. We illustrate the limitation of the weighted-sum approach with an example from the psychological literature and devise an effective heuristic algorithm for estimating both the supported and unsupported solutions of the Pareto efficient set.
Collapse
Affiliation(s)
- Michael J Brusco
- Department of Marketing, Florida State University, Florida 32306–1110, USA.
| | | |
Collapse
|
29
|
Abstract
Steinley (2007) provided a lower bound for the sum-of-squares error criterion function used in K-means clustering. In this article, on the basis of the lower bound, the authors propose a method to distinguish between 1 cluster (i.e., a single distribution) versus more than 1 cluster. Additionally, conditional on indicating there are multiple clusters, the procedure is extended to determine the number of clusters. Through a series of simulations, the proposed methodology is shown to outperform several other commonly used procedures for determining both the presence of clusters and their number.
Collapse
Affiliation(s)
- Douglas Steinley
- Department of Psychological Sciences, University of Missouri, Columbia, MO 65203, USA.
| | | |
Collapse
|
30
|
|
31
|
|
32
|
|
33
|
|
34
|
Abstract
The p-median clustering model represents a combinatorial approach to partition data sets into disjoint, nonhierarchical groups. Object classes are constructed around exemplars, that is, manifest objects in the data set, with the remaining instances assigned to their closest cluster centers. Effective, state-of-the-art implementations of p-median clustering are virtually unavailable in the popular social and behavioral science statistical software packages. We present p-median clustering, including a detailed description of its mechanics and a discussion of available software programs and their capabilities. Application to a complex structured data set on the perception of food items illustrates p-median clustering.
Collapse
Affiliation(s)
- Hans-Friedrich Köhn
- Department of Psychological Sciences, University of Missouri, Columbia, MO 65211-2500, USA.
| | | | | |
Collapse
|
35
|
Brusco MJ, Steinley D. Integer Programs for One- and Two-Mode Blockmodeling Based on Prespecified Image Matrices for Structural and Regular Equivalence. J Math Psychol 2009; 53:577-585. [PMID: 29769750 PMCID: PMC5951192 DOI: 10.1016/j.jmp.2009.08.003] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/30/2023]
Abstract
Establishing blockmodels for one- and two-mode binary network matrices has typically been accomplished using multiple restarts of heuristic algorithms that minimize functions of inconsistency with ideal block structure. Although these algorithms likely yield exceptional performance, they are not assured to provide blockmodels that optimize the functional indices. In this paper, we present integer programming models that, for a prespecified image matrix, can produce guaranteed optimal solutions for matrices of nontrivial size. Accordingly, analysts performing a confirmatory analysis of a prespecified blockmodel structure can apply our models directly to obtain an optimal solution. In exploratory cases where a blockmodel structure is not prespecified, we recommend a two-stage procedure, where a heuristic method is first used to identify an image matrix and the integer program is subsequently formulated and solved to identify the optimal solution for that image matrix. Although best suited for ideal block structures associated with structural equivalence, the integer programming models have the flexibility to accommodate functional indices pertaining to regular equivalence. Computational results are reported for a variety of one- and two-mode matrices from the blockmodeling literature.
Collapse
Affiliation(s)
- Michael J. Brusco
- Please address all correspondence to: Michael J. Brusco Department of Marketing College of Business Florida State University Tallahassee, FL 32306-1110
| | | |
Collapse
|
36
|
Brusco MJ, Steinley D, Cradit JD. An Exact Algorithm for Hierarchically Well-Formulated Subsets in Second-Order Polynomial Regression. Technometrics 2009. [DOI: 10.1198/tech.2009.08022] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/21/2022]
|
37
|
Abstract
The implementation of multiobjective programming methods in combinatorial data analysis is an emergent area of study with a variety of pragmatic applications in the behavioural sciences. Most notably, multiobjective programming provides a tool for analysts to model trade offs among competing criteria in clustering, seriation, and unidimensional scaling tasks. Although multiobjective programming has considerable promise, the technique can produce numerically appealing results that lack empirical validity. With this issue in mind, the purpose of this paper is to briefly review viable areas of application for multiobjective programming and, more importantly, to outline the importance of cross-validation when using this method in cluster analysis.
Collapse
Affiliation(s)
- Michael J Brusco
- Department of Marketing, Florida State University, Florida, USA.
| | | |
Collapse
|
38
|
Abstract
The kappa coefficient is one of the most widely used measures for evaluating the agreement between two raters asked to assign N objects to one of K nominal categories. Weighted versions of kappa enable partial credit to be awarded for near agreement, most notably in the case of ordinal categories. An exact significance test for weighted kappa can be conducted by enumerating all rater agreement tables with the same fixed marginal frequencies as the observed table, and accumulating the probabilities for all tables that produce a weighted kappa index that is greater than or equal to the observed measure. Unfortunately, complete enumeration of all tables is computationally unwieldy for modest values of N and K. We present an implicit enumeration algorithm for conducting an exact test of weighted kappa, which can be applied to tables of non-trivial size. The algorithm is particularly efficient for 'good' to 'excellent' values of weighted kappa that typically have very small p-values. Therefore, our method is beneficial for situations where resampling tests are of limited value because the number of trials needed to estimate the p-value tends to be large.
Collapse
Affiliation(s)
- Michael J Brusco
- Department of Marketing, College of Business, Florida State University, Tallahassee, FL 32306-110, USA.
| | | | | |
Collapse
|
39
|
Affiliation(s)
- Michael J. Brusco
- a Department of Marketing, College of Business , Florida State University , Tallahassee , FL , 32306-1110 , USA
| |
Collapse
|
40
|
Abstract
Frey and Dueck (Reports, 16 February 2007, p. 972) described an algorithm termed "affinity propagation" (AP) as a promising alternative to traditional data clustering procedures. We demonstrate that a well-established heuristic for the p-median problem often obtains clustering solutions with lower error than AP and produces these solutions in comparable computation time.
Collapse
Affiliation(s)
- Michael J Brusco
- College of Business, 307 Rovetta Business Building, Florida State University, Tallahassee, FL 32306-1110, USA.
| | | |
Collapse
|
41
|
Abstract
Clusterwise linear regression is a multivariate statistical procedure that attempts to cluster objects with the objective of minimizing the sum of the error sums of squares for the within-cluster regression models. In this article, we show that the minimization of this criterion makes no effort to distinguish the error explained by the within-cluster regression models from the error explained by the clustering process. In some cases, most of the variation in the response variable is explained by clustering the objects, with little additional benefit provided by the within-cluster regression models. Accordingly, there is tremendous potential for overfitting with clusterwise regression, which is demonstrated with numerical examples and simulation experiments. To guard against the misuse of clusterwise regression, we recommend a benchmarking procedure that compares the results for the observed empirical data with those obtained across a set of random permutations of the response measures. We also demonstrate the potential for overfitting via an empirical application related to the prediction of reflective judgment using high school and college performance measures.
Collapse
|
42
|
Abstract
A variance-to-range ratio variable weighting procedure is proposed. We show how this weighting method is theoretically grounded in the inherent variability found in data exhibiting cluster structure. In addition, a variable selection procedure is proposed to operate in conjunction with the variable weighting technique. The performances of these procedures are demonstrated in a simulation study, showing favorable results when compared with existing standardization methods. A detailed demonstration of the weighting and selection procedure is provided for the well-known Fisher Iris data and several synthetic data sets.
Collapse
|
43
|
Brusco MJ, Stahl S. An algorithm for extracting maximum cardinality subsets with perfect dominance or anti-Robinson structures. Br J Math Stat Psychol 2007; 60:377-393. [PMID: 17971276 DOI: 10.1348/000711006x107872] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/25/2023]
Abstract
A common criterion for seriation of asymmetric matrices is the maximization of the dominance index, which sums the elements above the main diagonal of a reordered matrix. Similarly, a popular seriation criterion for symmetric matrices is the maximization of an anti-Robinson gradient index, which is associated with the patterning of elements in the rows and columns of a reordered matrix. Although perfect dominance and perfect anti-Robinson structure are rarely achievable for empirical matrices, we can often identify a sizable subset of objects for which a perfect structure is realized. We present and demonstrate an algorithm for obtaining a maximum cardinality (i.e. the largest number of objects) subset of objects such that the seriation of the proximity matrix corresponding to the subset will have perfect structure. MATLAB implementations of the algorithm are available for dominance, anti-Robinson and strongly anti-Robinson structures.
Collapse
Affiliation(s)
- Michael J Brusco
- Department of Marketing, College of Business, Florida State University, Tallahassee, FL 32306, USA.
| | | |
Collapse
|
44
|
Abstract
The study of confusion data is a well established practice in psychology. Although many types of analytical approaches for confusion data are available, among the most common methods are the extraction of 1 or more subsets of stimuli, the partitioning of the complete stimulus set into distinct groups, and the ordering of the stimulus set. Although standard commercial software packages can sometimes facilitate these types of analyses, they are not guaranteed to produce optimal solutions. The authors present a MATLAB *.m file for preprocessing confusion matrices, which includes fitting of the similarity-choice model. Two additional MATLAB programs are available for optimally clustering stimuli on the basis of confusion data. The authors also developed programs for optimally ordering stimuli and extracting subsets of stimuli using information from confusion matrices. Together, these programs provide several pragmatic alternatives for the applied researcher when analyzing confusion data. Although the programs are described within the context of confusion data, they are also amenable to other types of proximity data.
Collapse
Affiliation(s)
- Michael J Brusco
- Marketing Department, College of Business, Florida State University, Tallahassee, FL 32306-1110, USA.
| | | |
Collapse
|
45
|
Brusco MJ. A Repetitive Branch-and-Bound Procedure for Minimum Within-Cluster Sums of Squares Partitioning. Psychometrika 2006; 71:347-363. [PMID: 28197962 DOI: 10.1007/s11336-004-1218-1] [Citation(s) in RCA: 11] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 02/04/2005] [Accepted: 06/09/2006] [Indexed: 06/06/2023]
Abstract
Minimization of the within-cluster sums of squares (WCSS) is one of the most important optimization criteria in cluster analysis. Although cluster analysis modules in commercial software packages typically use heuristic methods for this criterion, optimal approaches can be computationally feasible for problems of modest size. This paper presents a new branch-and-bound algorithm for minimizing WCSS. Algorithmic enhancements include an effective reordering of objects and a repetitive solution approach that precludes the need for splitting the data set, while maintaining strong bounds throughout the solution process. The new algorithm provided optimal solutions for problems with up to 240 objects and eight well-separated clusters. Poorly separated problems with no inherent cluster structure were optimally solved for up to 60 objects and six clusters. The repetitive branch-and-bound algorithm was also successfully applied to three empirical data sets from the classification literature.
Collapse
Affiliation(s)
- Michael J Brusco
- Florida State University, USA.
- Marketing Department, College of Business, Florida State University, Tallahassee, FL, 32306-1110.
| |
Collapse
|
46
|
Abstract
The decomposition of an asymmetric proximity matrix into its symmetric and skew-symmetric components is a well-known principle in combinatorial data analysis. The seriation of the skew-symmetric component can emphasize information corresponding to the sign or absolute magnitude of the matrix elements, and the choice of objective criterion can have a profound impact on the ordering. In this research note, we propose a bicriterion approach for seriation of a skew-symmetric matrix incorporating both sign and magnitude information. Two numerical demonstrations reveal that the bicriterion procedure is an effective alternative to direct seriation of the skew-symmetric matrix, facilitating favourable trade-offs among sign and magnitude information.
Collapse
Affiliation(s)
- Michael J Brusco
- Department of Marketing, College of Business, Florida State University, Tallahassee 32306-1110, USA.
| | | |
Collapse
|
47
|
Abstract
Partitioning indices associated with the within-cluster sums of pairwise dissimilarities often exhibit a systematic bias towards clusters of a particular size, whereas minimization of the partition diameter (i.e. the maximum dissimilarity element across all pairs of objects within the same cluster) does not typically have this problem. However, when the partition-diameter criterion is used, there is often a myriad of alternative optimal solutions that can vary significantly with respect to their substantive interpretation. We propose a bicriterion partitioning approach that considers both diameter and within-cluster sums in the optimization problem and facilitates selection from among the alternative optima. We developed several MATLAB-based exchange algorithms that rapidly provide excellent solutions to bicriterion partitioning problems. These algorithms were evaluated using synthetic data sets, as well as an empirical dissimilarity matrix.
Collapse
Affiliation(s)
- Michael J Brusco
- Department of Marketing, College of Business, Florida State University, Tallahassee 32306-1110, USA.
| | | |
Collapse
|
48
|
Abstract
In this article, we examine the concordance among 19 empirical confusion matrices for visual and tactual recognition of capital letters of the alphabet. As a measure of concordance, we employed an index based on within-stimulus triads of letters. Unlike correlation measures of agreement that are based on a one-to-one matching of matrix elements, the selected index directly captures the internal structures of the confusion matrices prior to the comparison. Permutation tests revealed statistically significant concordance among 166 of 171 pairs of matrices in the study. Concordance of confusion structure among tactual matrices tended to be somewhat stronger than concordance among the visual matrices.
Collapse
Affiliation(s)
- Michael J Brusco
- Department of Marketing, College of business, Florida State University, Tallahassee, Florida 32306-1110, USA.
| |
Collapse
|
49
|
Abstract
A number of important applications require the clustering of binary data sets. Traditional nonhierarchical cluster analysis techniques, such as the popular K-means algorithm, can often be successfully applied to these data sets. However, the presence of masking variables in a data set can impede the ability of the K-means algorithm to recover the true cluster structure. The author presents a heuristic procedure that selects an appropriate subset from among the set of all candidate clustering variables. Specifically, this procedure attempts to select only those variables that contribute to the definition of true cluster structure while eliminating variables that can hide (or mask) that true structure. Experimental testing of the proposed variable-selection procedure reveals that it is extremely successful at accomplishing this goal.
Collapse
Affiliation(s)
- Michael J Brusco
- Department of Marketing, College of Business, Florida State University, Tallahassee, FL 32306, USA.
| |
Collapse
|
50
|
Abstract
This paper focuses on the problem of developing a partition of n objects based on the information in a symmetric, non-negative dissimilarity matrix. The goal is to partition the objects into a set of non-overlapping subsets with the objective of minimizing the sum of the within-subset dissimilarities. Optimal solutions to this problem can be obtained using dynamic programming, branch-and-bound and other mathematical programming methods. An improved branch-and-bound algorithm is shown to be particularly efficient. The improvements include better upper bounds that are obtained via a fast exchange algorithm and, more importantly, sharper lower bounds obtained through sequential solution of submatrices. A modified version of the branch-and-bound algorithm for minimizing the diameter of a partition is also presented. Computational results for both synthetic and empirical dissimilarity matrices reveal the effectiveness of the branch-and-bound methodology.
Collapse
Affiliation(s)
- Michael J Brusco
- Department of Marketing, Florida State University, Tallahassee, FL 32306, USA.
| |
Collapse
|