26
|
Trogdon JG, Weir WH, Shai S, Mucha PJ, Kuo TM, Meyer AM, Stitzenberg KB. Comparing Shared Patient Networks Across Payers. J Gen Intern Med 2019; 34:2014-2020. [PMID: 30945065 PMCID: PMC6816773 DOI: 10.1007/s11606-019-04978-9] [Citation(s) in RCA: 14] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 05/11/2018] [Revised: 11/21/2018] [Accepted: 02/19/2019] [Indexed: 11/24/2022]
Abstract
BACKGROUND Measuring care coordination in administrative data facilitates important research to improve care quality. OBJECTIVE To compare shared patient networks constructed from administrative claims data across multiple payers. DESIGN Social network analysis of pooled cross sections of physicians treating prevalent colorectal cancer patients between 2003 and 2013. PARTICIPANTS Surgeons, medical oncologists, and radiation oncologists identified from North Carolina Central Cancer Registry data linked to Medicare claims (N = 1735) and private insurance claims (N = 1321). MAIN MEASURES Provider-level measures included the number of patients treated, the number of providers with whom they share patients (by specialty), the extent of patient sharing with each specialty, and network centrality. Network-level measures included the number of providers and shared patients, the density of shared-patient relationships among providers, and the size and composition of clusters of providers with a high level of patient sharing. RESULTS For 24.5% of providers, total patient volume rank differed by at least one quintile group between payers. Medicare claims missed 14.6% of all shared patient relationships between providers, but captured a greater number of patient-sharing relationships per provider compared with the private insurance database, even after controlling for the total number of patients (27.242 vs 26.044, p < 0.001). Providers in the private network shared a higher fraction of patients with other providers (0.226 vs 0.127, p < 0.001) compared to the Medicare network. Clustering coefficients for providers, weighted betweenness, and eigenvector centrality varied greatly across payers. Network differences led to some clusters of providers that existed in the combined network not being detected in Medicare alone. CONCLUSION Many features of shared patient networks constructed from a single-payer database differed from similar networks constructed from other payers' data. Depending on a study's goals, shortcomings of single-payer networks should be considered when using claims data to draw conclusions about provider behavior.
Collapse
|
27
|
Barnett I, Malik N, Kuijjer ML, Mucha PJ, Onnela JP. EndNote: Feature-based classification of networks. NETWORK SCIENCE (CAMBRIDGE UNIVERSITY PRESS) 2019; 7:438-444. [PMID: 31984135 PMCID: PMC6980283 DOI: 10.1017/nws.2019.21] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/22/2022]
Abstract
Network representations of systems from various scientific and societal domains are neither completely random nor fully regular, but instead appear to contain recurring structural features. These features tend to be shared by networks belonging to the same broad class, such as the class of social networks or the class of biological networks. Within each such class, networks describing similar systems tend to have similar features. This occurs presumably because networks representing similar systems would be expected to be generated by a shared set of domain specific mechanisms, and it should therefore be possible to classify networks based on their features at various structural levels. Here we describe and demonstrate a new hybrid approach that combines manual selection of network features of potential interest with existing automated classification methods. In particular, selecting well-known network features that have been studied extensively in social network analysis and network science literature, and then classifying networks on the basis of these features using methods such as random forest, which is known to handle the type of feature collinearity that arises in this setting, we find that our approach is able to achieve both higher accuracy and greater interpretability in shorter computation time than other methods.
Collapse
|
28
|
Robinson JI, Weir WH, Crowley JR, Hink T, Reske KA, Kwon JH, Burnham CAD, Dubberke ER, Mucha PJ, Henderson JP. Metabolomic networks connect host-microbiome processes to human Clostridioides difficile infections. J Clin Invest 2019; 129:3792-3806. [PMID: 31403473 DOI: 10.1172/jci126905] [Citation(s) in RCA: 55] [Impact Index Per Article: 11.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/02/2019] [Accepted: 06/11/2019] [Indexed: 12/15/2022] Open
Abstract
Clostridioides difficile infection (CDI) accounts for a substantial proportion of deaths attributable to antibiotic-resistant bacteria in the United States. Although C. difficile can be an asymptomatic colonizer, its pathogenic potential is most commonly manifested in patients with antibiotic-modified intestinal microbiomes. In a cohort of 186 hospitalized patients, we showed that host and microbe-associated shifts in fecal metabolomes had the potential to distinguish patients with CDI from those with non-C. difficile diarrhea and C. difficile colonization. Patients with CDI exhibited a chemical signature of Stickland amino acid fermentation that was distinct from those of uncolonized controls. This signature suggested that C. difficile preferentially catabolizes branched chain amino acids during CDI. Unexpectedly, we also identified a series of noncanonical, unsaturated bile acids that were depleted in patients with CDI. These bile acids may derive from an extended host-microbiome dehydroxylation network in uninfected patients. Bile acid composition and leucine fermentation defined a prototype metabolomic model with potential to distinguish clinical CDI from asymptomatic C. difficile colonization.
Collapse
|
29
|
Lee HW, Malik N, Shi F, Mucha PJ. Social clustering in epidemic spread on coevolving networks. Phys Rev E 2019; 99:062301. [PMID: 31330685 PMCID: PMC6790070 DOI: 10.1103/physreve.99.062301] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/20/2017] [Indexed: 11/29/2022]
Abstract
Even though transitivity is a central structural feature of social networks, its influence on epidemic spread on coevolving networks has remained relatively unexplored. Here we introduce and study an adaptive susceptible-infected-susceptible (SIS) epidemic model wherein the infection and network coevolve with nontrivial probability to close triangles during edge rewiring, leading to substantial reinforcement of network transitivity. This model provides an opportunity to study the role of transitivity in altering the SIS dynamics on a coevolving network. Using numerical simulations and approximate master equations (AMEs), we identify and examine a rich set of dynamical features in the model. In many cases, AMEs including transitivity reinforcement provide accurate predictions of stationary-state disease prevalence and network degree distributions. Furthermore, for some parameter settings, the AMEs accurately trace the temporal evolution of the system. We show that higher transitivity reinforcement in the model leads to lower levels of infective individuals in the population, when closing a triangle is the dominant rewiring mechanism. These methods and results may be useful in developing ideas and modeling strategies for controlling SIS-type epidemics.
Collapse
|
30
|
Gates KM, Fisher ZF, Arizmendi C, Henry TR, Duffy KA, Mucha PJ. Assessing the robustness of cluster solutions obtained from sparse count matrices. Psychol Methods 2019; 24:675-689. [PMID: 30742473 DOI: 10.1037/met0000204] [Citation(s) in RCA: 11] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/11/2022]
Abstract
Psychological researchers often seek to obtain cluster solutions from sparse count matrices (e.g., social networks; counts of symptoms that are in common for 2 given individuals; structural brain imaging). Increasingly, community detection methods are being used to subset the data in a data-driven manner. While many of these approaches perform well in simulation studies and thus offer some improvement upon traditional clustering approaches, there is no readily available approach for evaluating the robustness of these solutions in empirical data. Researchers have no way of knowing if their results are due to noise. We describe here 2 approaches novel to the field of psychology that enable evaluation of cluster solution robustness. This tutorial also explains the use of an associated R package, perturbR, which provides researchers with the ability to use the methods described herein. In the first approach, the cluster assignment from the original matrix is compared against cluster assignments obtained by randomly perturbing the edges in the matrix. Stable cluster solutions should not demonstrate large changes in the presence of small perturbations. For the second approach, Monte Carlo simulations of random matrices that have the same properties as the original matrix are generated. The distribution of quality scores ("modularity") obtained from the cluster solutions from these matrices are then compared with the score obtained from the original matrix results. From this, one can assess if the results are better than what would be expected by chance. perturbR automates these 2 methods, providing an easy-to-use resource for psychological researchers. We demonstrate the utility of this package using benchmark simulated data generated from a previous study and then apply the methods to publicly available empirical data obtained from social networks and structural neuroimaging. (PsycINFO Database Record (c) 2019 APA, all rights reserved).
Collapse
|
31
|
Granell C, Mucha PJ. Epidemic spreading in localized environments with recurrent mobility patterns. Phys Rev E 2018; 97:052302. [PMID: 29906863 PMCID: PMC6195814 DOI: 10.1103/physreve.97.052302] [Citation(s) in RCA: 26] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/21/2017] [Indexed: 11/08/2022]
Abstract
The spreading of epidemics is very much determined by the structure of the contact network, which may be impacted by the mobility dynamics of the individuals themselves. In confined scenarios where a small, closed population spends most of its time in localized environments and has easily identifiable mobility patterns—such as workplaces, university campuses, or schools—it is of critical importance to identify the factors controlling the rate of disease spread. Here, we present a discrete-time, metapopulation-based model to describe the transmission of susceptible-infected-susceptible-like diseases that take place in confined scenarios where the mobilities of the individuals are not random but, rather, follow clear recurrent travel patterns. This model allows analytical determination of the onset of epidemics, as well as the ability to discern which contact structures are most suited to prevent the infection to spread. It thereby determines whether common prevention mechanisms, as isolation, are worth implementing in such a scenario and their expected impact.
Collapse
|
32
|
Kirk JM, Kim SO, Inoue K, Smola MJ, Lee DM, Schertzer MD, Wooten JS, Baker AR, Sprague D, Collins DW, Horning CR, Wang S, Chen Q, Weeks KM, Mucha PJ, Calabrese JM. Functional classification of long non-coding RNAs by k-mer content. Nat Genet 2018; 50:1474-1482. [PMID: 30224646 PMCID: PMC6262761 DOI: 10.1038/s41588-018-0207-8] [Citation(s) in RCA: 121] [Impact Index Per Article: 20.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/24/2017] [Accepted: 07/24/2018] [Indexed: 12/30/2022]
Abstract
The functions of most long non-coding RNAs (lncRNAs) are unknown. In contrast to proteins, lncRNAs with similar functions often lack linear sequence homology; thus, the identification of function in one lncRNA rarely informs the identification of function in others. We developed a sequence comparison method to deconstruct linear sequence relationships in lncRNAs and evaluate similarity based on the abundance of short motifs called kmers. We found that lncRNAs of related function often had similar kmer profiles despite lacking linear homology, and that kmer profiles correlated with protein binding to lncRNAs and with their subcellular localization. Using a novel assay to quantify Xist-like regulatory potential, we directly demonstrated that evolutionarily unrelated lncRNAs can encode similar function through different spatial arrangements of related sequence motifs. Kmer-based classification is a powerful approach to detect recurrent relationships between sequence and function in lncRNAs.
Collapse
|
33
|
Heroy S, Taylor D, Shi FB, Forest MG, Mucha PJ. RIGID GRAPH COMPRESSION: MOTIF-BASED RIGIDITY ANALYSIS FOR DISORDERED FIBER NETWORKS. MULTISCALE MODELING & SIMULATION : A SIAM INTERDISCIPLINARY JOURNAL 2018; 16:1283-1304. [PMID: 30450018 PMCID: PMC6234004 DOI: 10.1137/17m1157271] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/09/2023]
Abstract
Using particle-scale models to accurately describe property enhancements and phase transitions in macroscopic behavior is a major engineering challenge in composite materials science. To address some of these challenges, we use the graph theoretic property of rigidity to model mechanical reinforcement in composites with stiff rod-like particles. We develop an efficient algorithmic approach called rigid graph compression (RGC) to describe the transition from floppy to rigid in disordered fiber networks ("rod-hinge systems"), which form the reinforcing phase in many composite systems. To establish RGC on a firm theoretical foundation, we adapt rigidity matroid theory to identify primitive topological network motifs that serve as rules for composing interacting rigid particles into larger rigid components. This approach is computationally efficient and stable, because RGC requires only topological information about rod interactions (encoded by a sparse unweighted network) rather than geometrical details such as rod locations or pairwise distances (as required in rigidity matroid theory). We conduct numerical experiments on simulated two-dimensional rod-hinge systems to demonstrate that RGC closely approximates the rigidity percolation threshold for such systems, through comparison with the pebble game algorithm (which is exact in two dimensions). Importantly, whereas the pebble game is derived from Laman's condition and is only valid in two dimensions, the RGC approach naturally extends to higher dimensions.
Collapse
|
34
|
Li Z, Mucha PJ, Taylor D. NETWORK-ENSEMBLE COMPARISONS WITH STOCHASTIC REWIRING AND VON NEUMANN ENTROPY. SIAM JOURNAL ON APPLIED MATHEMATICS 2018; 78:897-920. [PMID: 30319156 PMCID: PMC6181241 DOI: 10.1137/17m1124218] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/08/2023]
Abstract
Assessing whether a given network is typical or atypical for a random-network ensemble (i.e., network-ensemble comparison) has widespread applications ranging from null-model selection and hypothesis testing to clustering and classifying networks. We develop a framework for network-ensemble comparison by subjecting the network to stochastic rewiring. We study two rewiring processes-uniform and degree-preserved rewiring-which yield random-network ensembles that converge to the Erdős-Rényi and configuration-model ensembles, respectively. We study convergence through von Neumann entropy (VNE)-a network summary statistic measuring information content based on the spectra of a Laplacian matrix-and develop a perturbation analysis for the expected effect of rewiring on VNE. Our analysis yields an estimate for how many rewires are required for a given network to resemble a typical network from an ensemble, offering a computationally efficient quantity for network-ensemble comparison that does not require simulation of the corresponding rewiring process.
Collapse
|
35
|
Lee HW, Malik N, Mucha PJ. Evolutionary prisoner's dilemma games coevolving on adaptive networks. JOURNAL OF COMPLEX NETWORKS 2018; 6:1-23. [PMID: 29732158 PMCID: PMC5931405 DOI: 10.1093/comnet/cnx018] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/08/2023]
Abstract
We study a model for switching strategies in the Prisoner's Dilemma game on adaptive networks of player pairings that coevolve as players attempt to maximize their return. We use a node-based strategy model wherein each player follows one strategy at a time (cooperate or defect) across all of its neighbors, changing that strategy and possibly changing partners in response to local changes in the network of player pairing and in the strategies used by connected partners. We compare and contrast numerical simulations with existing pair approximation differential equations for describing this system, as well as more accurate equations developed here using the framework of approximate master equations. We explore the parameter space of the model, demonstrating the relatively high accuracy of the approximate master equations for describing the system observations made from simulations. We study two variations of this partner-switching model to investigate the system evolution, predict stationary states, and compare the total utilities and other qualitative differences between these two model variants.
Collapse
|
36
|
Strano E, Giometto A, Shai S, Bertuzzo E, Mucha PJ, Rinaldo A. The scaling structure of the global road network. ROYAL SOCIETY OPEN SCIENCE 2017; 4:170590. [PMID: 29134071 PMCID: PMC5666254 DOI: 10.1098/rsos.170590] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 06/01/2017] [Accepted: 09/18/2017] [Indexed: 05/30/2023]
Abstract
Because of increasing global urbanization and its immediate consequences, including changes in patterns of food demand, circulation and land use, the next century will witness a major increase in the extent of paved roads built worldwide. To model the effects of this increase, it is crucial to understand whether possible self-organized patterns are inherent in the global road network structure. Here, we use the largest updated database comprising all major roads on the Earth, together with global urban and cropland inventories, to suggest that road length distributions within croplands are indistinguishable from urban ones, once rescaled to account for the difference in mean road length. Such similarity extends to road length distributions within urban or agricultural domains of a given area. We find two distinct regimes for the scaling of the mean road length with the associated area, holding in general at small and at large values of the latter. In suitably large urban and cropland domains, we find that mean and total road lengths increase linearly with their domain area, differently from earlier suggestions. Scaling regimes suggest that simple and universal mechanisms regulate urban and cropland road expansion at the global scale. As such, our findings bear implications for global road infrastructure growth based on land-use change and for planning policies sustaining urban expansions.
Collapse
|
37
|
Weir WH, Emmons S, Gibson R, Taylor D, Mucha PJ. Post-Processing Partitions to Identify Domains of Modularity Optimization. ALGORITHMS 2017; 10. [PMID: 29046743 PMCID: PMC5642987 DOI: 10.3390/a10030093] [Citation(s) in RCA: 23] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
Abstract
We introduce the Convex Hull of Admissible Modularity Partitions (CHAMP) algorithm to prune and prioritize different network community structures identified across multiple runs of possibly various computational heuristics. Given a set of partitions, CHAMP identifies the domain of modularity optimization for each partition—i.e., the parameter-space domain where it has the largest modularity relative to the input set—discarding partitions with empty domains to obtain the subset of partitions that are “admissible” candidate community structures that remain potentially optimal over indicated parameter domains. Importantly, CHAMP can be used for multi-dimensional parameter spaces, such as those for multilayer networks where one includes a resolution parameter and interlayer coupling. Using the results from CHAMP, a user can more appropriately select robust community structures by observing the sizes of domains of optimization and the pairwise comparisons between partitions in the admissible subset. We demonstrate the utility of CHAMP with several example networks. In these examples, CHAMP focuses attention onto pruned subsets of admissible partitions that are 20-to-1785 times smaller than the sets of unique partitions obtained by community detection heuristics that were input into CHAMP.
Collapse
|
38
|
Taylor D, Caceres RS, Mucha PJ. Super-Resolution Community Detection for Layer-Aggregated Multilayer Networks. PHYSICAL REVIEW. X 2017; 7:031056. [PMID: 29445565 PMCID: PMC5809009 DOI: 10.1103/physrevx.7.031056] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Indexed: 06/08/2023]
Abstract
Applied network science often involves preprocessing network data before applying a network-analysis method, and there is typically a theoretical disconnect between these steps. For example, it is common to aggregate time-varying network data into windows prior to analysis, and the trade-offs of this preprocessing are not well understood. Focusing on the problem of detecting small communities in multilayer networks, we study the effects of layer aggregation by developing random-matrix theory for modularity matrices associated with layer-aggregated networks with N nodes and L layers, which are drawn from an ensemble of Erdős-Rényi networks with communities planted in subsets of layers. We study phase transitions in which eigenvectors localize onto communities (allowing their detection) and which occur for a given community provided its size surpasses a detectability limit K* . When layers are aggregated via a summation, we obtain [Formula: see text], where T is the number of layers across which the community persists. Interestingly, if T is allowed to vary with L, then summation-based layer aggregation enhances small-community detection even if the community persists across a vanishing fraction of layers, provided that T/L decays more slowly than 𝒪(L-1/2). Moreover, we find that thresholding the summation can, in some cases, cause K* to decay exponentially, decreasing by orders of magnitude in a phenomenon we call super-resolution community detection. In other words, layer aggregation with thresholding is a nonlinear data filter enabling detection of communities that are otherwise too small to detect. Importantly, different thresholds generally enhance the detectability of communities having different properties, illustrating that community detection can be obscured if one analyzes network data using a single threshold.
Collapse
|
39
|
Taylor D, Myers SA, Clauset A, Porter MA, Mucha PJ. EIGENVECTOR-BASED CENTRALITY MEASURES FOR TEMPORAL NETWORKS . MULTISCALE MODELING & SIMULATION : A SIAM INTERDISCIPLINARY JOURNAL 2017; 15:537-574. [PMID: 29046619 PMCID: PMC5643020 DOI: 10.1137/16m1066142] [Citation(s) in RCA: 33] [Impact Index Per Article: 4.7] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/30/2023]
Abstract
Numerous centrality measures have been developed to quantify the importances of nodes in time-independent networks, and many of them can be expressed as the leading eigenvector of some matrix. With the increasing availability of network data that changes in time, it is important to extend such eigenvector-based centrality measures to time-dependent networks. In this paper, we introduce a principled generalization of network centrality measures that is valid for any eigenvector-based centrality. We consider a temporal network with N nodes as a sequence of T layers that describe the network during different time windows, and we couple centrality matrices for the layers into a supra-centrality matrix of size NT × NT whose dominant eigenvector gives the centrality of each node i at each time t. We refer to this eigenvector and its components as a joint centrality, as it reflects the importances of both the node i and the time layer t. We also introduce the concepts of marginal and conditional centralities, which facilitate the study of centrality trajectories over time. We find that the strength of coupling between layers is important for determining multiscale properties of centrality, such as localization phenomena and the time scale of centrality changes. In the strong-coupling regime, we derive expressions for time-averaged centralities, which are given by the zeroth-order terms of a singular perturbation expansion. We also study first-order terms to obtain first-order-mover scores, which concisely describe the magnitude of nodes' centrality changes over time. As examples, we apply our method to three empirical temporal networks: the United States Ph.D. exchange in mathematics, costarring relationships among top-billed actors during the Golden Age of Hollywood, and citations of decisions from the United States Supreme Court.
Collapse
|
40
|
Aikat J, Carsey TM, Fecho K, Jeffay K, Krishnamurthy A, Mucha PJ, Rajasekar A, Ahalt SC. Scientific Training in the Era of Big Data: A New Pedagogy for Graduate Education. BIG DATA 2017; 5:12-18. [PMID: 28287837 DOI: 10.1089/big.2016.0014] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/06/2023]
Abstract
The era of "big data" has radically altered the way scientific research is conducted and new knowledge is discovered. Indeed, the scientific method is rapidly being complemented and even replaced in some fields by data-driven approaches to knowledge discovery. This paradigm shift is sometimes referred to as the "fourth paradigm" of data-intensive and data-enabled scientific discovery. Interdisciplinary research with a hard emphasis on translational outcomes is becoming the norm in all large-scale scientific endeavors. Yet, graduate education remains largely focused on individual achievement within a single scientific domain, with little training in team-based, interdisciplinary data-oriented approaches designed to translate scientific data into new solutions to today's critical challenges. In this article, we propose a new pedagogy for graduate education: data-centered learning for the domain-data scientist. Our approach is based on four tenets: (1) Graduate training must incorporate interdisciplinary training that couples the domain sciences with data science. (2) Graduate training must prepare students for work in data-enabled research teams. (3) Graduate training must include education in teaming and leadership skills for the data scientist. (4) Graduate training must provide experiential training through academic/industry practicums and internships. We emphasize that this approach is distinct from today's graduate training, which offers training in either data science or a domain science (e.g., biology, sociology, political science, economics, and medicine), but does not integrate the two within a single curriculum designed to prepare the next generation of domain-data scientists. We are in the process of implementing the proposed pedagogy through the development of a new graduate curriculum based on the above four tenets, and we describe herein our strategy, progress, and lessons learned. While our pedagogy was developed in the context of graduate education, the general approach of data-centered learning can and should be applied to students and professionals at any stage of their education, including at the K-12, undergraduate, graduate, and professional levels. We believe that the time is right to embed data-centered learning within our educational system and, thus, generate the talent required to fully harness the potential of big data.
Collapse
|
41
|
Malik N, Shi F, Lee HW, Mucha PJ. Transitivity reinforcement in the coevolving voter model. CHAOS (WOODBURY, N.Y.) 2016; 26:123112. [PMID: 28039984 PMCID: PMC5848690 DOI: 10.1063/1.4972116] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 09/16/2016] [Accepted: 11/29/2016] [Indexed: 06/06/2023]
Abstract
One of the fundamental structural properties of many networks is triangle closure. Whereas the influence of this transitivity on a variety of contagion dynamics has been previously explored, existing models of coevolving or adaptive network systems typically use rewiring rules that randomize away this important property, raising questions about their applicability. In contrast, we study here a modified coevolving voter model dynamics that explicitly reinforces and maintains such clustering. Carrying out numerical simulations for a variety of parameter settings, we establish that the transitions and dynamical states observed in coevolving voter model networks without clustering are altered by reinforcing transitivity in the model. We then use a semi-analytical framework in terms of approximate master equations to predict the dynamical behaviors of the model for a variety of parameter settings.
Collapse
|
42
|
Entwisle B, Williams NE, Verdery AM, Rindfuss RR, Walsh SJ, Malanson GP, Mucha PJ, Frizzelle BG, McDaniel PM, Yao X, Heumann BW, Prasartkul P, Sawangdee Y, Jampaklay A. Climate Shocks and Migration: An Agent-Based Modeling Approach. POPULATION AND ENVIRONMENT 2016; 38:47-71. [PMID: 27594725 PMCID: PMC5004973 DOI: 10.1007/s11111-016-0254-y] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/22/2023]
Abstract
This is a study of migration responses to climate shocks. We construct an agent-based model that incorporates dynamic linkages between demographic behaviors, such as migration, marriage, and births, and agriculture and land use, which depend on rainfall patterns. The rules and parameterization of our model are empirically derived from qualitative and quantitative analyses of a well-studied demographic field site, Nang Rong district, Northeast Thailand. With this model, we simulate patterns of migration under four weather regimes in a rice economy: 1) a reference, 'normal' scenario; 2) seven years of unusually wet weather; 3) seven years of unusually dry weather; and 4) seven years of extremely variable weather. Results show relatively small impacts on migration. Experiments with the model show that existing high migration rates and strong selection factors, which are unaffected by climate change, are likely responsible for the weak migration response.
Collapse
|
43
|
Taylor D, Shai S, Stanley N, Mucha PJ. Enhanced Detectability of Community Structure in Multilayer Networks through Layer Aggregation. PHYSICAL REVIEW LETTERS 2016; 116:228301. [PMID: 27314740 PMCID: PMC5125641 DOI: 10.1103/physrevlett.116.228301] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 11/16/2015] [Indexed: 05/24/2023]
Abstract
Many systems are naturally represented by a multilayer network in which edges exist in multiple layers that encode different, but potentially related, types of interactions, and it is important to understand limitations on the detectability of community structure in these networks. Using random matrix theory, we analyze detectability limitations for multilayer (specifically, multiplex) stochastic block models (SBMs) in which L layers are derived from a common SBM. We study the effect of layer aggregation on detectability for several aggregation methods, including summation of the layers' adjacency matrices for which we show the detectability limit vanishes as O(L^{-1/2}) with increasing number of layers, L. Importantly, we find a similar scaling behavior when the summation is thresholded at an optimal value, providing insight into the common-but not well understood-practice of thresholding pairwise-interaction data to obtain sparse network representations.
Collapse
|
44
|
Stanley N, Shai S, Taylor D, Mucha PJ. Clustering network layers with the strata multilayer stochastic block model. IEEE TRANSACTIONS ON NETWORK SCIENCE AND ENGINEERING 2016; 3:95-105. [PMID: 28435844 PMCID: PMC5400296 DOI: 10.1109/tnse.2016.2537545] [Citation(s) in RCA: 22] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/09/2023]
Abstract
Multilayer networks are a useful data structure for simultaneously capturing multiple types of relationships between a set of nodes. In such networks, each relational definition gives rise to a layer. While each layer provides its own set of information, community structure across layers can be collectively utilized to discover and quantify underlying relational patterns between nodes. To concisely extract information from a multilayer network, we propose to identify and combine sets of layers with meaningful similarities in community structure. In this paper, we describe the "strata multilayer stochastic block model" (sMLSBM), a probabilistic model for multilayer community structure. The central extension of the model is that there exist groups of layers, called "strata", which are defined such that all layers in a given stratum have community structure described by a common stochastic block model (SBM). That is, layers in a stratum exhibit similar node-to-community assignments and SBM probability parameters. Fitting the sMLSBM to a multilayer network provides a joint clustering that yields node-to-community and layer-to-stratum assignments, which cooperatively aid one another during inference. We describe an algorithm for separating layers into their appropriate strata and an inference technique for estimating the SBM parameters for each stratum. We demonstrate our method using synthetic networks and a multilayer network inferred from data collected in the Human Microbiome Project.
Collapse
|
45
|
Malik N, Bookhagen B, Mucha PJ. Spatiotemporal patterns and trends of Indian monsoonal rainfall extremes. GEOPHYSICAL RESEARCH LETTERS 2016; 43:1710-1717. [PMID: 27909349 PMCID: PMC5125774 DOI: 10.1002/2016gl067841] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/26/2023]
Abstract
In this study, we provide a comprehensive analysis of trends in the extremes during the Indian summer monsoon (ISM) months (June to September) at different temporal and spatial scales. Our goal is to identify and quantify spatiotemporal patterns and trends that have emerged during the recent decades and may be associated with changing climatic conditions. Our analysis primarily relies on quantile regression that avoids making any subjective choices on spatial, temporal, or intensity pattern of extreme rainfall events. Our analysis divides the Indian monsoon region into climatic compartments that show different and partly opposing trends. These include strong trends towards intensified droughts in Northwest India, parts of Peninsular India, and Myanmar; in contrast, parts of Pakistan, Northwest Himalaya, and Central India show increased extreme daily rain intensity leading to higher flood vulnerability. Our analysis helps explain previously contradicting results of trends in average ISM rainfall.
Collapse
|
46
|
Verdery AM, Mouw T, Bauldry S, Mucha PJ. Correction: Network Structure and Biased Variance Estimation in Respondent Driven Sampling. PLoS One 2016; 11:e0148006. [PMID: 26799651 PMCID: PMC4723124 DOI: 10.1371/journal.pone.0148006] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022] Open
|
47
|
Verdery AM, Mouw T, Bauldry S, Mucha PJ. Network Structure and Biased Variance Estimation in Respondent Driven Sampling. PLoS One 2015; 10:e0145296. [PMID: 26679927 PMCID: PMC4682989 DOI: 10.1371/journal.pone.0145296] [Citation(s) in RCA: 26] [Impact Index Per Article: 2.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/16/2013] [Accepted: 12/02/2015] [Indexed: 11/19/2022] Open
Abstract
This paper explores bias in the estimation of sampling variance in Respondent Driven Sampling (RDS). Prior methodological work on RDS has focused on its problematic assumptions and the biases and inefficiencies of its estimators of the population mean. Nonetheless, researchers have given only slight attention to the topic of estimating sampling variance in RDS, despite the importance of variance estimation for the construction of confidence intervals and hypothesis tests. In this paper, we show that the estimators of RDS sampling variance rely on a critical assumption that the network is First Order Markov (FOM) with respect to the dependent variable of interest. We demonstrate, through intuitive examples, mathematical generalizations, and computational experiments that current RDS variance estimators will always underestimate the population sampling variance of RDS in empirical networks that do not conform to the FOM assumption. Analysis of 215 observed university and school networks from Facebook and Add Health indicates that the FOM assumption is violated in every empirical network we analyze, and that these violations lead to substantially biased RDS estimators of sampling variance. We propose and test two alternative variance estimators that show some promise for reducing biases, but which also illustrate the limits of estimating sampling variance with only partial information on the underlying population social network.
Collapse
|
48
|
Parker KS, Wilson JD, Marschall J, Mucha PJ, Henderson JP. Network Analysis Reveals Sex- and Antibiotic Resistance-Associated Antivirulence Targets in Clinical Uropathogens. ACS Infect Dis 2015; 1:523-532. [PMID: 26985454 PMCID: PMC4788272 DOI: 10.1021/acsinfecdis.5b00022] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/26/2015] [Indexed: 01/29/2023]
Abstract
Increasing antibiotic resistance among uropathogenic Escherichia coli (UPEC) is driving interest in therapeutic targeting of nonconserved virulence factor (VF) genes. The ability to formulate efficacious combinations of antivirulence agents requires an improved understanding of how UPEC deploy these genes. To identify clinically relevant VF combinations, we applied contemporary network analysis and biclustering algorithms to VF profiles from a large, previously characterized inpatient clinical cohort. These mathematical approaches identified four stereotypical VF combinations with distinctive relationships to antibiotic resistance and patient sex that are independent of traditional phylogenetic grouping. Targeting resistance- or sex-associated VFs based upon these contemporary mathematical approaches may facilitate individualized anti-infective therapies and identify synergistic VF combinations in bacterial pathogens.
Collapse
|
49
|
Jeub LGS, Balachandran P, Porter MA, Mucha PJ, Mahoney MW. Think locally, act locally: detection of small, medium-sized, and large communities in large networks. PHYSICAL REVIEW. E, STATISTICAL, NONLINEAR, AND SOFT MATTER PHYSICS 2015; 91:012821. [PMID: 25679670 PMCID: PMC5125638 DOI: 10.1103/physreve.91.012821] [Citation(s) in RCA: 20] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 03/15/2014] [Indexed: 06/04/2023]
Abstract
It is common in the study of networks to investigate intermediate-sized (or "meso-scale") features to try to gain an understanding of network structure and function. For example, numerous algorithms have been developed to try to identify "communities," which are typically construed as sets of nodes with denser connections internally than with the remainder of a network. In this paper, we adopt a complementary perspective that communities are associated with bottlenecks of locally biased dynamical processes that begin at seed sets of nodes, and we employ several different community-identification procedures (using diffusion-based and geodesic-based dynamics) to investigate community quality as a function of community size. Using several empirical and synthetic networks, we identify several distinct scenarios for "size-resolved community structure" that can arise in real (and realistic) networks: (1) the best small groups of nodes can be better than the best large groups (for a given formulation of the idea of a good community); (2) the best small groups can have a quality that is comparable to the best medium-sized and large groups; and (3) the best small groups of nodes can be worse than the best large groups. As we discuss in detail, which of these three cases holds for a given network can make an enormous difference when investigating and making claims about network community structure, and it is important to take this into account to obtain reliable downstream conclusions. Depending on which scenario holds, one may or may not be able to successfully identify "good" communities in a given network (and good communities might not even exist for a given community quality measure), the manner in which different small communities fit together to form meso-scale network structures can be very different, and processes such as viral propagation and information diffusion can exhibit very different dynamics. In addition, our results suggest that, for many large realistic networks, the output of locally biased methods that focus on communities that are centered around a given seed node (or set of seed nodes) might have better conceptual grounding and greater practical utility than the output of global community-detection methods. They also illustrate structural properties that are important to consider in the development of better benchmark networks to test methods for community detection.
Collapse
|
50
|
Wilson JD, Wang S, Mucha PJ, Bhamidi S, Nobel AB. A testing based extraction algorithm for identifying significant communities in networks. Ann Appl Stat 2014. [DOI: 10.1214/14-aoas760] [Citation(s) in RCA: 20] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
|