1
|
Sormunen S, Leskelä L, Saramäki J. Distinguishing subsampled power laws from other heavy-tailed distributions. Phys Rev E 2024; 109:054308. [PMID: 38907423 DOI: 10.1103/physreve.109.054308] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/20/2023] [Accepted: 04/11/2024] [Indexed: 06/24/2024]
Abstract
Distinguishing power-law distributions from other heavy-tailed distributions is challenging, and this task is often further complicated by subsampling effects. In this work, we evaluate the performance of two commonly used methods for detecting power-law distributions-the maximum likelihood method of Clauset et al. and the extreme value method of Voitalov et al.-in distinguishing subsampled power laws from two other heavy-tailed distributions, the lognormal and the stretched exponential distributions. We focus on a random subsampling method commonly applied in network science and biological sciences. In this subsampling scheme, we are ultimately interested in the frequency distribution of elements with a certain number of constituent parts-for example, species with k individuals or nodes with k connections-and each part is selected to the subsample with an equal probability. We investigate how well the results obtained from low-subsampling-depth subsamples generalize to the original distribution. Our results show that the power-law exponent of the original distribution can be estimated fairly accurately from subsamples, but classifying the distribution correctly is more challenging. The maximum likelihood method falsely rejects the power-law hypothesis for a large fraction of subsamples from power-law distributions. While the extreme value method correctly recognizes subsampled power-law distributions with all tested subsampling depths, its capacity to distinguish power laws from the heavy-tailed alternatives is limited. However, these false positives tend to result not from the subsampling itself but from the estimators' inability to classify the original sample correctly. In fact, we show that the extreme value method can sometimes be expected to perform better on subsamples than on the original samples from the lognormal and the stretched exponential distributions, while the contrary is true for the main tests included in the maximum likelihood method.
Collapse
Affiliation(s)
- Silja Sormunen
- Department of Computer Science, Aalto University, 00076 Espoo, Finland
| | - Lasse Leskelä
- Department of Mathematics and Systems Analysis, Aalto University, 00076 Espoo, Finland
| | - Jari Saramäki
- Department of Computer Science, Aalto University, 00076 Espoo, Finland
| |
Collapse
|
2
|
Lee MJ, Kim JH, Goh KI, Lee SH, Son SW, Lee DS. Degree distributions under general node removal: Power-law or Poisson? Phys Rev E 2022; 106:064309. [PMID: 36671153 DOI: 10.1103/physreve.106.064309] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/12/2022] [Accepted: 11/01/2022] [Indexed: 06/17/2023]
Abstract
Perturbations made to networked systems may result in partial structural loss, such as a blackout in a power-grid system. Investigating the resulting disturbance in network properties is quintessential to understand real networks in action. The removal of nodes is a representative disturbance, but previous studies are seemingly contrasting about its effect on arguably the most fundamental network statistic, the degree distribution. The key question is about the functional form of the degree distributions that can be altered during node removal or sampling. The functional form is decisive in the remaining subnetwork's static and dynamical properties. In this work, we clarify the situation by utilizing the relative entropies with respect to the reference distributions in the Poisson and power-law form, to quantify the distance between the subnetwork's degree distribution and either of the reference distributions. Introducing general sequential node removal processes with continuously different levels of hub protection to encompass a series of scenarios including uniform random removal and preferred or protective (i.e., biased random) removal of the hub, we classify the altered degree distributions starting from various power-law forms by comparing two relative entropy values. From the extensive investigation in various scenarios based on direct node-removal simulations and by solving the rate equation of degree distributions, we discover in the parameter space two distinct regimes, one where the degree distribution is closer to the power-law reference distribution and the other closer to the Poisson distribution.
Collapse
Affiliation(s)
- Mi Jin Lee
- Department of Applied Physics, Hanyang University, Ansan 15588, Korea
| | - Jung-Ho Kim
- Department of Physics, Korea University, Seoul 02841, Korea
| | - Kwang-Il Goh
- Department of Physics, Korea University, Seoul 02841, Korea
| | - Sang Hoon Lee
- Department of Physics and Research Institute of Natural Science, Gyeongsang National University, Jinju 52828, Korea
- Future Convergence Technology Research Institute, Gyeongsang National University, Jinju 52849, Korea
| | - Seung-Woo Son
- Department of Applied Physics, Hanyang University, Ansan 15588, Korea
| | - Deok-Sun Lee
- School of Computational Sciences and Center for AI and Natural Sciences, Korea Institute for Advanced Study, Seoul 02455, Korea
| |
Collapse
|
3
|
Akbar S, Saritha SK. Quantum inspired community detection for analysis of biodiversity change driven by land-use conversion and climate change. Sci Rep 2021; 11:14332. [PMID: 34253748 PMCID: PMC8275618 DOI: 10.1038/s41598-021-93122-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/04/2020] [Accepted: 06/21/2021] [Indexed: 02/06/2023] Open
Abstract
Community detection remains little explored in the analysis of biodiversity change. The challenges linked with global biodiversity change have also multiplied manifold in the past few decades. Moreover, most studies concerning biodiversity change lack the quantitative treatment central to species distribution modeling. Empirical analysis of species distribution and abundance is thus integral to the study of biodiversity loss and biodiversity alterations. Community detection is therefore expected to efficiently model the topological aspect of biodiversity change driven by land-use conversion and climate change; given that it has already proven superior for diverse problems in the domain of social network analysis and subgroup discovery in complex systems. Thus, quantum inspired community detection is proposed as a novel technique to predict biodiversity change considering tiger population in eighteen states of India; leading to benchmarking of two novel datasets. Elements of land-use conversion and climate change are explored to design these datasets viz.-Landscape based distribution and Number of tiger reserves based distribution respectively; for predicting regions expected to maximize Tiger population growth. Furthermore, validation of the proposed framework on the said datasets is performed using standard community detection metrics like-Modularity, Normalized Mutual Information (NMI), Adjusted Rand Index (ARI), Degree distribution, Degree centrality and Edge-betweenness centrality. Quantum inspired community detection has also been successful in demonstrating an association between biodiversity change, land-use conversion and climate change; validated statistically by Pearson's correlation coefficient and p value test. Finally, modularity distribution based on parameter tuning establishes the superiority of the second dataset based on the number of Tiger reserves-in predicting regions maximizing Tiger population growth fostering species distribution and abundance; apart from scripting a stronger correlation of biodiversity change with land-use conversion.
Collapse
Affiliation(s)
- Sana Akbar
- Department of CSE, MANIT, Bhopal, India.
| | | |
Collapse
|
4
|
Antunes N, Bhamidi S, Guo T, Pipiras V, Wang B. Sampling Based Estimation of In-Degree Distribution for Directed Complex Networks. J Comput Graph Stat 2021. [DOI: 10.1080/10618600.2021.1873143] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/22/2022]
Affiliation(s)
- Nelson Antunes
- Center for Computational and Stochastic Mathematics, University of Lisbon and University of Algarve, Lisbon, Portugal
| | - Shankar Bhamidi
- Department of Statistics and Operations Research, University of North Carolina, Chapel Hill, NC
| | - Tianjian Guo
- Department of Statistics and Operations Research, University of North Carolina, Chapel Hill, NC
| | - Vladas Pipiras
- Department of Statistics and Operations Research, University of North Carolina, Chapel Hill, NC
| | - Bang Wang
- Department of Statistics, University of Pittsburgh, Pittsburgh, PA
| |
Collapse
|
5
|
Zhao Y, Jiang H, Chen Q, Qin Y, Xie H, Wu Y, Liu S, Zhou Z, Xia J, Zhou F. Preserving Minority Structures in Graph Sampling. IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS 2021; 27:1698-1708. [PMID: 33048731 DOI: 10.1109/tvcg.2020.3030428] [Citation(s) in RCA: 11] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/11/2023]
Abstract
Sampling is a widely used graph reduction technique to accelerate graph computations and simplify graph visualizations. By comprehensively analyzing the literature on graph sampling, we assume that existing algorithms cannot effectively preserve minority structures that are rare and small in a graph but are very important in graph analysis. In this work, we initially conduct a pilot user study to investigate representative minority structures that are most appealing to human viewers. We then perform an experimental study to evaluate the performance of existing graph sampling algorithms regarding minority structure preservation. Results confirm our assumption and suggest key points for designing a new graph sampling approach named mino-centric graph sampling (MCGS). In this approach, a triangle-based algorithm and a cut-point-based algorithm are proposed to efficiently identify minority structures. A set of importance assessment criteria are designed to guide the preservation of important minority structures. Three optimization objectives are introduced into a greedy strategy to balance the preservation between minority and majority structures and suppress the generation of new minority structures. A series of experiments and case studies are conducted to evaluate the effectiveness of the proposed MCGS.
Collapse
|
6
|
Izaguirre A, Montes-Rojas G. Horvitz-Thompson estimator under partial information with an application to network degree distribution. COMMUN STAT-SIMUL C 2021. [DOI: 10.1080/03610918.2018.1554117] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/27/2022]
Affiliation(s)
| | - Gabriel Montes-Rojas
- Universidad De Buenos Aires-Facultad de Ciencias Economicas, Buenos Aires, Argentina
| |
Collapse
|
7
|
de Aguiar MAM, Newman EA, Pires MM, Yeakel JD, Boettiger C, Burkle LA, Gravel D, Guimarães PR, O'Donnell JL, Poisot T, Fortin MJ, Hembry DH. Revealing biases in the sampling of ecological interaction networks. PeerJ 2019; 7:e7566. [PMID: 31534845 PMCID: PMC6727833 DOI: 10.7717/peerj.7566] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/23/2019] [Accepted: 07/29/2019] [Indexed: 11/20/2022] Open
Abstract
The structure of ecological interactions is commonly understood through analyses of interaction networks. However, these analyses may be sensitive to sampling biases with respect to both the interactors (the nodes of the network) and interactions (the links between nodes), because the detectability of species and their interactions is highly heterogeneous. These ecological and statistical issues directly affect ecologists’ abilities to accurately construct ecological networks. However, statistical biases introduced by sampling are difficult to quantify in the absence of full knowledge of the underlying ecological network’s structure. To explore properties of large-scale ecological networks, we developed the software EcoNetGen, which constructs and samples networks with predetermined topologies. These networks may represent a wide variety of communities that vary in size and types of ecological interactions. We sampled these networks with different mathematical sampling designs that correspond to methods used in field observations. The observed networks generated by each sampling process were then analyzed with respect to the number of components, size of components and other network metrics. We show that the sampling effort needed to estimate underlying network properties depends strongly both on the sampling design and on the underlying network topology. In particular, networks with random or scale-free modules require more complete sampling to reveal their structure, compared to networks whose modules are nested or bipartite. Overall, modules with nested structure were the easiest to detect, regardless of the sampling design used. Sampling a network starting with any species that had a high degree (e.g., abundant generalist species) was consistently found to be the most accurate strategy to estimate network structure. Because high-degree species tend to be generalists, abundant in natural communities relative to specialists, and connected to each other, sampling by degree may therefore be common but unintentional in empirical sampling of networks. Conversely, sampling according to module (representing different interaction types or taxa) results in a rather complete view of certain modules, but fails to provide a complete picture of the underlying network. To reduce biases introduced by sampling methods, we recommend that these findings be incorporated into field design considerations for projects aiming to characterize large species interaction networks.
Collapse
Affiliation(s)
- Marcus A M de Aguiar
- Instituto de Física "Gleb Wataghin", Universidade Estadual de Campinas, Campinas, São Paulo, Brazil
| | - Erica A Newman
- Department of Ecology and Evolutionary Biology, University of Arizona, Tucson, AZ, USA
| | - Mathias M Pires
- Departamento de Biologia Animal, Instituto de Biologia, Universidade Estadual de Campinas, Campinas, São Paulo, Brazil
| | - Justin D Yeakel
- School of Natural Sciences, University of California, Merced, CA, USA.,Santa Fe Institute, Santa Fe, NM, USA
| | - Carl Boettiger
- Department of Environmental Science, Policy, and Management, University of California, Berkeley, Berkeley, CA, USA
| | - Laura A Burkle
- Department of Ecology, Montana State University, Bozeman, MT, USA
| | - Dominique Gravel
- Département de Biologie, Université de Sherbrooke, Sherbrooke, QC, Canada
| | - Paulo R Guimarães
- Departamento de Ecologia, Instituto de Biociências, Universidade de São Paulo, São Paulo, Brazil
| | - James L O'Donnell
- School of Marine and Environmental Affairs, University of Washington, Seattle, WA, USA
| | - Timothée Poisot
- Département de Sciences Biologiques, Université de Montréal, Montréal, QC, Canada.,Québec Centre for Biodiversity Sciences, Montréal, QC, Canada
| | - Marie-Josée Fortin
- Department of Ecology and Evolutionary Biology, University of Toronto, Toronto, ON, Canada
| | - David H Hembry
- Department of Ecology and Evolutionary Biology, University of Arizona, Tucson, AZ, USA.,Department of Entomology, Cornell University, Ithaca, NY, USA
| |
Collapse
|
8
|
Murase Y, Jo HH, Török J, Kertész J, Kaski K. Sampling networks by nodal attributes. Phys Rev E 2019; 99:052304. [PMID: 31212524 DOI: 10.1103/physreve.99.052304] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/12/2019] [Indexed: 06/09/2023]
Abstract
In a social network individuals or nodes connect to other nodes by choosing one of the channels of communication at a time to re-establish the existing social links. Since available data sets are usually restricted to a limited number of channels or layers, these autonomous decision making processes by the nodes constitute the sampling of a multiplex network leading to just one (though very important) example of sampling bias caused by the behavior of the nodes. We develop a general setting to get insight and understand the class of network sampling models, where the probability of sampling a link in the original network depends on the attributes h of its adjacent nodes. Assuming that the nodal attributes are independently drawn from an arbitrary distribution ρ(h) and that the sampling probability r(h_{i},h_{j}) for a link ij of nodal attributes h_{i} and h_{j} is also arbitrary, we derive exact analytic expressions of the sampled network for such network characteristics as the degree distribution, degree correlation, and clustering spectrum. The properties of the sampled network turn out to be sums of quantities for the original network topology weighted by the factors stemming from the sampling. Based on our analysis, we find that the sampled network may have sampling-induced network properties that are absent in the original network, which implies the potential risk of a naive generalization of the results of the sample to the entire original network. We also consider the case, when neighboring nodes have correlated attributes to show how to generalize our formalism for such sampling bias and we get good agreement between the analytic results and the numerical simulations.
Collapse
Affiliation(s)
- Yohsuke Murase
- RIKEN Center for Computational Science, 7-1-26, Minatojima-minami-machi, Chuo-ku, Kobe, Hyogo, 650-0047, Japan
| | - Hang-Hyun Jo
- Asia Pacific Center for Theoretical Physics, Pohang 37673, Republic of Korea
- Department of Physics, Pohang University of Science and Technology, Pohang 37673, Republic of Korea
- Department of Computer Science, Aalto University, Espoo FI-00076, Finland
| | - János Török
- Department of Theoretical Physics, Budapest University of Technology and Economics, H-1111 Budapest, Hungary
- Department of Network and Data Science, Central European University, Nádor u. 9, H-1051 Budapest, Hungary
- MTA-BME Morphodynamics Research Group, Budapest University of Technology and Economics, H-1111 Budapest, Hungary
| | - János Kertész
- Department of Computer Science, Aalto University, Espoo FI-00076, Finland
- Department of Network and Data Science, Central European University, Nádor u. 9, H-1051 Budapest, Hungary
| | - Kimmo Kaski
- Department of Computer Science, Aalto University, Espoo FI-00076, Finland
- The Alan Turing Institute, British Library, 96 Euston Road, London NW1 2DB, United Kingdom
| |
Collapse
|
9
|
Gerlach M, Altmann EG. Testing Statistical Laws in Complex Systems. PHYSICAL REVIEW LETTERS 2019; 122:168301. [PMID: 31075025 DOI: 10.1103/physrevlett.122.168301] [Citation(s) in RCA: 21] [Impact Index Per Article: 4.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Received: 09/19/2018] [Revised: 12/19/2018] [Indexed: 06/09/2023]
Abstract
The availability of large datasets requires an improved view on statistical laws in complex systems, such as Zipf's law of word frequencies, the Gutenberg-Richter law of earthquake magnitudes, or scale-free degree distribution in networks. In this Letter, we discuss how the statistical analysis of these laws are affected by correlations present in the observations, the typical scenario for data from complex systems. We first show how standard maximum-likelihood recipes lead to false rejections of statistical laws in the presence of correlations. We then propose a conservative method (based on shuffling and undersampling the data) to test statistical laws and find that accounting for correlations leads to smaller rejection rates and larger confidence intervals on estimated parameters.
Collapse
Affiliation(s)
- Martin Gerlach
- Department of Chemical and Biological Engineering, Northwestern University, Evanston, Illinois 60208, USA
| | - Eduardo G Altmann
- School of Mathematics and Statistics, University of Sydney, 2006 NSW, Australia
| |
Collapse
|
10
|
Liu X, Pan L, Stanley HE, Gao J. Controllability of giant connected components in a directed network. Phys Rev E 2017; 95:042318. [PMID: 28505769 DOI: 10.1103/physreve.95.042318] [Citation(s) in RCA: 21] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/10/2016] [Indexed: 02/02/2023]
Abstract
When controlling a complex networked system it is not feasible to control the full network because many networks, including biological, technological, and social systems, are massive in size and complexity. But neither is it necessary to control the full network. In complex networks, the giant connected components provide the essential information about the entire system. How to control these giant connected components of a network remains an open question. We derive the mathematical expression of the degree distributions for four types of giant connected components and develop an analytic tool for studying the controllability of these giant connected components. We find that for both Erdős-Rényi (ER) networks and scale-free (SF) networks with p fraction of remaining nodes, the minimum driver node density to control the giant component first increases and then decreases as p increases from zero to one, showing a peak at a critical point p=p_{m}. We find that, for ER networks, the peak value of the driver node density remains the same regardless of its average degree 〈k〉 and that it is determined by p_{m}〈k〉. In addition, we find that for SF networks the minimum driver node densities needed to control the giant components of networks decrease as the degree distribution exponents increase. Comparing the controllability of the giant components of ER networks and SF networks, we find that when the fraction of remaining nodes p is low, the giant in-connected, out-connected, and strong-connected components in ER networks have lower controllability than those in SF networks.
Collapse
Affiliation(s)
- Xueming Liu
- Key Laboratory of Image Information Processing and Intelligent Control of Education Ministry of China, School of Automation, Huazhong University of Science and Technology, Wuhan 430074, Hubei, China.,Center for Polymer Studies and Department of Physics, Boston University, Boston, Massachusetts 02215, USA
| | - Linqiang Pan
- Key Laboratory of Image Information Processing and Intelligent Control of Education Ministry of China, School of Automation, Huazhong University of Science and Technology, Wuhan 430074, Hubei, China.,School of Electric and Information Engineering, Zhengzhou University of Light Industry, Zhengzhou 450002, Henan, China
| | - H Eugene Stanley
- Center for Polymer Studies and Department of Physics, Boston University, Boston, Massachusetts 02215, USA
| | - Jianxi Gao
- Center for Complex Network Research and Department of Physics, Northeastern University, Boston, Massachusetts 02115, USA
| |
Collapse
|
11
|
Török J, Murase Y, Jo HH, Kertész J, Kaski K. What Big Data tells: Sampling the social network by communication channels. Phys Rev E 2016; 94:052319. [PMID: 27967040 DOI: 10.1103/physreve.94.052319] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/27/2015] [Indexed: 11/07/2022]
Abstract
Big Data has become the primary source of understanding the structure and dynamics of the society at large scale. The network of social interactions can be considered as a multiplex, where each layer corresponds to one communication channel and the aggregate of all of them constitutes the entire social network. However, usually one has information only about one of the channels or even a part of it, which should be considered as a subset or sample of the whole. Here we introduce a model based on a natural bilateral communication channel selection mechanism, which for one channel leads to consistent changes in the network properties. For example, while it is expected that the degree distribution of the whole social network has a maximum at a value larger than one, we get a monotonically decreasing distribution as observed in empirical studies of single-channel data. We also find that assortativity may occur or get strengthened due to the sampling method. We analyze the far-reaching consequences of our findings.
Collapse
Affiliation(s)
- János Török
- Department of Theoretical Physics, Budapest University of Technology and Economics, Budapest H-1111, Hungary.,Center for Network Science, Central European University, Budapest H-1051, Hungary
| | - Yohsuke Murase
- RIKEN Advanced Institute for Computational Science, Kobe, Hyogo 650-0047, Japan
| | - Hang-Hyun Jo
- Department of Physics, Pohang University of Science and Technology, Pohang 37673, Republic of Korea.,Department of Computer Science, Aalto University School of Science, P.O. Box 15500, Espoo, Finland
| | - János Kertész
- Department of Theoretical Physics, Budapest University of Technology and Economics, Budapest H-1111, Hungary.,Center for Network Science, Central European University, Budapest H-1051, Hungary.,Department of Computer Science, Aalto University School of Science, P.O. Box 15500, Espoo, Finland
| | - Kimmo Kaski
- Department of Computer Science, Aalto University School of Science, P.O. Box 15500, Espoo, Finland
| |
Collapse
|
12
|
Annibale A, Coolen ACC, Planell-Morell N. Quantifying noise in mass spectrometry and yeast two-hybrid protein interaction detection experiments. J R Soc Interface 2015; 12:0573. [PMID: 26333811 DOI: 10.1098/rsif.2015.0573] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
Protein interaction networks (PINs) are popular means to visualize the proteome. However, PIN datasets are known to be noisy, incomplete and biased by the experimental protocols used to detect protein interactions. This paper aims at understanding the connection between true protein interactions and the protein interaction datasets that have been obtained using the most popular experimental techniques, i.e. mass spectronomy and yeast two-hybrid. We start from the observation that the adjacency matrix of a PIN, i.e. the binary matrix which defines, for every pair of proteins in the network, whether or not there is a link, has a special form, that we call separable. This induces precise relationships between the moments of the degree distribution (i.e. the average number of links that a protein in the network has, its variance, etc.) and the number of short loops (i.e. triangles, squares, etc.) along the links of the network. These relationships provide powerful tools to test the reliability of datasets and hint at the underlying biological mechanism with which proteins and complexes recruit each other.
Collapse
Affiliation(s)
- A Annibale
- Department of Mathematics, King's College London, The Strand, London WC2R 2LS, UK
| | - A C C Coolen
- Department of Mathematics, King's College London, The Strand, London WC2R 2LS, UK Institute for Mathematical and Molecular Biomedicine, King's College London, Hodgkin Building, London SE1 1UL, UK London Institute for Mathematical Sciences, 22 South Audley Street, London W1K 2NY, UK
| | - N Planell-Morell
- Department of Mathematics, King's College London, The Strand, London WC2R 2LS, UK
| |
Collapse
|
13
|
Zhang Y, Kolaczyk ED, Spencer BD. Estimating network degree distributions under sampling: An inverse problem, with applications to monitoring social media networks. Ann Appl Stat 2015. [DOI: 10.1214/14-aoas800] [Citation(s) in RCA: 25] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
|
14
|
Aliakbary S, Motallebi S, Rashidian S, Habibi J, Movaghar A. Distance metric learning for complex networks: towards size-independent comparison of network structures. CHAOS (WOODBURY, N.Y.) 2015; 25:023111. [PMID: 25725647 DOI: 10.1063/1.4908605] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/04/2023]
Abstract
Real networks show nontrivial topological properties such as community structure and long-tail degree distribution. Moreover, many network analysis applications are based on topological comparison of complex networks. Classification and clustering of networks, model selection, and anomaly detection are just some applications of network comparison. In these applications, an effective similarity metric is needed which, given two complex networks of possibly different sizes, evaluates the amount of similarity between the structural features of the two networks. Traditional graph comparison approaches, such as isomorphism-based methods, are not only too time consuming but also inappropriate to compare networks with different sizes. In this paper, we propose an intelligent method based on the genetic algorithms for integrating, selecting, and weighting the network features in order to develop an effective similarity measure for complex networks. The proposed similarity metric outperforms state of the art methods with respect to different evaluation criteria.
Collapse
Affiliation(s)
- Sadegh Aliakbary
- Department of Computer Engineering, Sharif University of Technology, Tehran, Iran
| | - Sadegh Motallebi
- Department of Computer Engineering, Sharif University of Technology, Tehran, Iran
| | - Sina Rashidian
- Department of Computer Engineering, Sharif University of Technology, Tehran, Iran
| | - Jafar Habibi
- Department of Computer Engineering, Sharif University of Technology, Tehran, Iran
| | - Ali Movaghar
- Department of Computer Engineering, Sharif University of Technology, Tehran, Iran
| |
Collapse
|
15
|
Bliss CA, Danforth CM, Dodds PS. Estimation of global network statistics from incomplete data. PLoS One 2014; 9:e108471. [PMID: 25338183 PMCID: PMC4206292 DOI: 10.1371/journal.pone.0108471] [Citation(s) in RCA: 21] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/28/2014] [Accepted: 08/24/2014] [Indexed: 11/30/2022] Open
Abstract
Complex networks underlie an enormous variety of social, biological, physical, and virtual systems. A profound complication for the science of complex networks is that in most cases, observing all nodes and all network interactions is impossible. Previous work addressing the impacts of partial network data is surprisingly limited, focuses primarily on missing nodes, and suggests that network statistics derived from subsampled data are not suitable estimators for the same network statistics describing the overall network topology. We generate scaling methods to predict true network statistics, including the degree distribution, from only partial knowledge of nodes, links, or weights. Our methods are transparent and do not assume a known generating process for the network, thus enabling prediction of network statistics for a wide variety of applications. We validate analytical results on four simulated network classes and empirical data sets of various sizes. We perform subsampling experiments by varying proportions of sampled data and demonstrate that our scaling methods can provide very good estimates of true network statistics while acknowledging limits. Lastly, we apply our techniques to a set of rich and evolving large-scale social networks, Twitter reply networks. Based on 100 million tweets, we use our scaling techniques to propose a statistical characterization of the Twitter Interactome from September 2008 to November 2008. Our treatment allows us to find support for Dunbar's hypothesis in detecting an upper threshold for the number of active social contacts that individuals maintain over the course of one week.
Collapse
Affiliation(s)
- Catherine A. Bliss
- Department of Mathematics and Statistics, Vermont Complex Systems Center, The Computational Story Lab, and the Vermont Advanced Computing Core, University of Vermont, Burlington, Vermont, United States of America
- * E-mail:
| | - Christopher M. Danforth
- Department of Mathematics and Statistics, Vermont Complex Systems Center, The Computational Story Lab, and the Vermont Advanced Computing Core, University of Vermont, Burlington, Vermont, United States of America
| | - Peter Sheridan Dodds
- Department of Mathematics and Statistics, Vermont Complex Systems Center, The Computational Story Lab, and the Vermont Advanced Computing Core, University of Vermont, Burlington, Vermont, United States of America
| |
Collapse
|
16
|
|
17
|
Son SW, Christensen C, Bizhani G, Foster DV, Grassberger P, Paczuski M. Sampling properties of directed networks. PHYSICAL REVIEW. E, STATISTICAL, NONLINEAR, AND SOFT MATTER PHYSICS 2012; 86:046104. [PMID: 23214649 DOI: 10.1103/physreve.86.046104] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Received: 08/05/2012] [Indexed: 06/01/2023]
Abstract
For many real-world networks only a small "sampled" version of the original network may be investigated; those results are then used to draw conclusions about the actual system. Variants of breadth-first search (BFS) sampling, which are based on epidemic processes, are widely used. Although it is well established that BFS sampling fails, in most cases, to capture the IN component(s) of directed networks, a description of the effects of BFS sampling on other topological properties is all but absent from the literature. To systematically study the effects of sampling biases on directed networks, we compare BFS sampling to random sampling on complete large-scale directed networks. We present new results and a thorough analysis of the topological properties of seven complete directed networks (prior to sampling), including three versions of Wikipedia, three different sources of sampled World Wide Web data, and an Internet-based social network. We detail the differences that sampling method and coverage can make to the structural properties of sampled versions of these seven networks. Most notably, we find that sampling method and coverage affect both the bow-tie structure and the number and structure of strongly connected components in sampled networks. In addition, at a low sampling coverage (i.e., less than 40%), the values of average degree, variance of out-degree, degree autocorrelation, and link reciprocity are overestimated by 30% or more in BFS-sampled networks and only attain values within 10% of the corresponding values in the complete networks when sampling coverage is in excess of 65%. These results may cause us to rethink what we know about the structure, function, and evolution of real-world directed networks.
Collapse
Affiliation(s)
- S-W Son
- Complexity Science Group, University of Calgary, Calgary T2N 1N4, Canada.
| | | | | | | | | | | |
Collapse
|
18
|
Abstract
The ability to analyze large biological networks proves to be a computationally expensive task, but the information one can gain is worth the cost and effort. In cancer research for example, one is able to derive knowledge about putative drug targets by revealing the strengths and weaknesses inherent in a protein-protein interaction (PPI) network. Further, network analyses can be used to optimize high-throughput genetic and proteomic experiments. In addition, the study of biological networks is now an active part of molecular biology. In this chapter, we review techniques for studying biological networks in general but with a focus on PPI networks, including an example of a bacterial PPI network. After a brief introduction, we concentrate on methods based on the analysis of subnetworks, namely, graph motifs and graphlets.
Collapse
|
19
|
Djebbari A, Ali M, Otasek D, Kotlyar M, Fortney K, Wong S, Hrvojic A, Jurisica I. NAViGaTOR: Large Scalable and Interactive Navigation and Analysis of Large Graphs. ACTA ACUST UNITED AC 2011. [DOI: 10.1080/15427951.2011.604289] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/15/2022]
|
20
|
Annibale A, Coolen ACC. What you see is not what you get: how sampling affects macroscopic features of biological networks. Interface Focus 2011; 1:836-56. [PMID: 23226585 DOI: 10.1098/rsfs.2011.0050] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/30/2011] [Accepted: 08/30/2011] [Indexed: 11/12/2022] Open
Abstract
We use mathematical methods from the theory of tailored random graphs to study systematically the effects of sampling on topological features of large biological signalling networks. Our aim in doing so is to increase our quantitative understanding of the relation between true biological networks and the imperfect and often biased samples of these networks that are reported in public data repositories and used by biomedical scientists. We derive exact explicit formulae for degree distributions and degree correlation kernels of sampled networks, in terms of the degree distributions and degree correlation kernels of the underlying true network, for a broad family of sampling protocols that include random and connectivity-dependent node and/or link undersampling as well as random and connectivity-dependent link oversampling. Our predictions are in excellent agreement with numerical simulations.
Collapse
Affiliation(s)
- A Annibale
- Department of Mathematics, King's College London, The Strand, London WC2R 2LS, UK
| | | |
Collapse
|
21
|
A timestepper-based approach for the coarse-grained analysis of microscopic neuronal simulators on networks: Bifurcation and rare-events micro- to macro-computations. Neurocomputing 2011. [DOI: 10.1016/j.neucom.2011.06.018] [Citation(s) in RCA: 12] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/22/2022]
|
22
|
Stumpf MPH, Wiuf C. Incomplete and noisy network data as a percolation process. J R Soc Interface 2010; 7:1411-9. [PMID: 20378609 PMCID: PMC2935600 DOI: 10.1098/rsif.2010.0044] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/01/2010] [Accepted: 03/18/2010] [Indexed: 11/12/2022] Open
Abstract
We discuss the ramifications of noisy and incomplete observations of network data on the existence of a giant connected component (GCC). The existence of a GCC in a random graph can be described in terms of a percolation process, and building on general results for classes of random graphs with specified degree distributions we derive percolation thresholds above which GCCs exist. We show that sampling and noise can have a profound effect on the perceived existence of a GCC and find that both processes can destroy it. We also show that the absence of a GCC puts a theoretical upper bound on the false-positive rate and relate our percolation analysis to experimental protein-protein interaction data.
Collapse
Affiliation(s)
- Michael P. H. Stumpf
- Centre for Bioinformatics, Division of Molecular Biosciences, Imperial College London, London SW72AZ, UK
- Institute of Mathematical Sciences, Imperial College London, London SW72AZ, UK
| | - Carsten Wiuf
- Centre for Bioinformatics, Division of Molecular Biosciences, Imperial College London, London SW72AZ, UK
- Bioinformatics Research Center, University of Aarhus, 8000 Aarhus C, Denmark
| |
Collapse
|
23
|
Lovison A, Manzini G, Maritan A, Putti M, Rinaldo A. Spanning traceroutes over modular networks and general scaling degree distributions. PHYSICAL REVIEW. E, STATISTICAL, NONLINEAR, AND SOFT MATTER PHYSICS 2010; 81:036105. [PMID: 20365813 DOI: 10.1103/physreve.81.036105] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Received: 07/11/2009] [Revised: 01/05/2010] [Indexed: 05/29/2023]
Abstract
We analyze the class of networks characterized by modular structure where a sequence of l Erdös-Renyi random networks of size Nl with random average degrees is joined by links whose structure must remain immaterial. We find that traceroutes spanning the entire macronetwork exhibit scaling degree distributions P(k) approximately k-gamma, where gamma depends on how the degrees of the joined clusters are distributed. We thus suggest that yet another mechanism for the dynamic origin of arbitrary power-law degree distributions observed in natural and artificial networks, many of which belong to the range 2<or=gamma<or=3 , may be found in random processes on modular networks.
Collapse
Affiliation(s)
- Alberto Lovison
- Dipartimento di Matematica Pura e Applicata, Università di Padova, I-35121 Padova, Italy.
| | | | | | | | | |
Collapse
|
24
|
Sokolov IM, Eliazar II. Sampling from scale-free networks and the matchmaking paradox. PHYSICAL REVIEW. E, STATISTICAL, NONLINEAR, AND SOFT MATTER PHYSICS 2010; 81:026107. [PMID: 20365631 DOI: 10.1103/physreve.81.026107] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Received: 07/02/2009] [Revised: 10/28/2009] [Indexed: 05/29/2023]
Abstract
Consider a large finite scale-free network consisting of M>>1 nodes and N>>1 links, in which the degree distribution of links per bond is governed by a power-law P(n) approximately n(-1-alpha) with exponent 0<alpha<1. A subset of m<<M nodes is sampled arbitrarily, yielding the empirical sample mean eta : the average number of links per node, within the sampled subset. We explore the statistics of the sample mean eta and show that its fluctuations around the network mean nu=N/M are extremely broad and strongly skewed--yielding typical values, which are systematically and significantly smaller than the network mean nu. Applying these results to the case of bipartite scale-free networks, we show that the sample means of the two parts of these networks generally differ--a fact we refer to as the "matchmaking paradox."
Collapse
Affiliation(s)
- I M Sokolov
- Institut für Physik, Humboldt-Universität zu Berlin, Newtonstr. 15, D-12489 Berlin, Germany
| | | |
Collapse
|
25
|
Bobashev G, Morris RJ, Goedecke DM. Sampling for global epidemic models and the topology of an international airport network. PLoS One 2008; 3:e3154. [PMID: 18776932 PMCID: PMC2522280 DOI: 10.1371/journal.pone.0003154] [Citation(s) in RCA: 23] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/19/2008] [Accepted: 08/07/2008] [Indexed: 11/18/2022] Open
Abstract
Mathematical models that describe the global spread of infectious diseases such as influenza, severe acute respiratory syndrome (SARS), and tuberculosis (TB) often consider a sample of international airports as a network supporting disease spread. However, there is no consensus on how many cities should be selected or on how to select those cities. Using airport flight data that commercial airlines reported to the Official Airline Guide (OAG) in 2000, we have examined the network characteristics of network samples obtained under different selection rules. In addition, we have examined different size samples based on largest flight volume and largest metropolitan populations. We have shown that although the bias in network characteristics increases with the reduction of the sample size, a relatively small number of areas that includes the largest airports, the largest cities, the most-connected cities, and the most central cities is enough to describe the dynamics of the global spread of influenza. The analysis suggests that a relatively small number of cities (around 200 or 300 out of almost 3000) can capture enough network information to adequately describe the global spread of a disease such as influenza. Weak traffic flows between small airports can contribute to noise and mask other means of spread such as the ground transportation.
Collapse
Affiliation(s)
- Georgiy Bobashev
- RTI International, Research Triangle Park, North Carolina, United States of America.
| | | | | |
Collapse
|
26
|
Yang L, Vondriska TM, Han Z, Maclellan WR, Weiss JN, Qu Z. Deducing topology of protein-protein interaction networks from experimentally measured sub-networks. BMC Bioinformatics 2008; 9:301. [PMID: 18598366 PMCID: PMC2474618 DOI: 10.1186/1471-2105-9-301] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/22/2008] [Accepted: 07/03/2008] [Indexed: 01/17/2023] Open
Abstract
Background Protein-protein interaction networks are commonly sampled using yeast two hybrid approaches. However, whether topological information reaped from these experimentally-measured sub-networks can be extrapolated to complete protein-protein interaction networks is unclear. Results By analyzing various experimental protein-protein interaction datasets, we found that they are not random samples of the parent networks. Based on the experimental bait-prey behaviors, our computer simulations show that these non-random sampling features may affect the topological information. We tested the hypothesis that a core sub-network exists within the experimentally sampled network that better maintains the topological characteristics of the parent protein-protein interaction network. We developed a method to filter the experimentally sampled network to result in a core sub-network that more accurately reflects the topology of the parent network. These findings have fundamental implications for large-scale protein interaction studies and for our understanding of the behavior of cellular networks. Conclusion The topological information from experimental measured networks network as is may not be the correct source for topological information about the parent protein-protein interaction network. We define a core sub-network that more accurately reflects the topology of the parent network.
Collapse
Affiliation(s)
- Ling Yang
- Department of Medicine, David Geffen School of Medicine at University of California, Los Angeles, California 90095, USA .
| | | | | | | | | | | |
Collapse
|
27
|
Chiang T, Scholtens D, Sarkar D, Gentleman R, Huber W. Coverage and error models of protein-protein interaction data by directed graph analysis. Genome Biol 2008; 8:R186. [PMID: 17845715 PMCID: PMC2375024 DOI: 10.1186/gb-2007-8-9-r186] [Citation(s) in RCA: 32] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/12/2007] [Revised: 05/26/2007] [Accepted: 09/10/2007] [Indexed: 01/10/2023] Open
Abstract
Using a directed graph model for bait to prey systems and a multinomial error model, we assessed the error statistics in all published large-scale datasets for Saccharomyces cerevisiae and characterized them by three traits: the set of tested interactions, artifacts that lead to false-positive or false-negative observations, and estimates of the stochastic error rates that affect the data. These traits provide a prerequisite for the estimation of the protein interactome and its modules.
Collapse
Affiliation(s)
- Tony Chiang
- EMBL, European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SD, UK
- Fred Hutchinson Cancer Research Center, Computational Biology Group, Fairview Avenue North, Seattle, WA 98109-1024, USA
| | - Denise Scholtens
- Northwestern University, Department of Preventive Medicine, N Lake Shore Drive, Chicago, IL 60611-4402, USA
| | - Deepayan Sarkar
- Fred Hutchinson Cancer Research Center, Computational Biology Group, Fairview Avenue North, Seattle, WA 98109-1024, USA
| | - Robert Gentleman
- Fred Hutchinson Cancer Research Center, Computational Biology Group, Fairview Avenue North, Seattle, WA 98109-1024, USA
| | - Wolfgang Huber
- EMBL, European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SD, UK
| |
Collapse
|
28
|
Abstract
After the completion of the human and other genome projects it emerged that the number of genes in organisms as diverse as fruit flies, nematodes, and humans does not reflect our perception of their relative complexity. Here, we provide reliable evidence that the size of protein interaction networks in different organisms appears to correlate much better with their apparent biological complexity. We develop a stable and powerful, yet simple, statistical procedure to estimate the size of the whole network from subnet data. This approach is then applied to a range of eukaryotic organisms for which extensive protein interaction data have been collected and we estimate the number of interactions in humans to be approximately 650,000. We find that the human interaction network is one order of magnitude bigger than the Drosophila melanogaster interactome and approximately 3 times bigger than in Caenorhabditis elegans.
Collapse
|
29
|
Abstract
We review the estimation of coverage and error rate in high-throughput protein-protein interaction datasets and argue that reports of the low quality of such data are to a substantial extent based on misinterpretations. Probabilistic statistical models and methods can be used to estimate properties of interest and to make the best use of the available data.
Collapse
|
30
|
Jensen HJ. Emergence of network structure in models of collective evolution and evolutionary dynamics. Proc Math Phys Eng Sci 2008. [DOI: 10.1098/rspa.2008.0032] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
We consider an evolving network of a fixed number of nodes. The allocation of edges is a dynamical stochastic process inspired by biological reproduction dynamics, namely by deleting and duplicating existing nodes and their edges. The properties of the degree distribution in the stationary state is analysed by use of the Fokker–Planck equation. For a broad range of parameters, exponential degree distributions are observed. The mechanism responsible for this behaviour is illuminated by use of a simple mean field equation and reproduced by the Fokker–Planck equation. The latter is treated exactly, except for an approximate treatment of the degree–degree correlations. In the limit of 0 mutations, the degree distribution becomes a power law with exponent 1.
Collapse
Affiliation(s)
- Henrik Jeldtoft Jensen
- Institute for Mathematical Sciences, Imperial College London53 Prince's Gate, South Kensington campus, London SW7 2PG, UK
- Department of Mathematics, Imperial College LondonSouth Kensington campus, London SW7 2AZ, UK
| |
Collapse
|
31
|
|
32
|
Waclaw B, Sokolov IM. Finite-size effects in Barabási-Albert growing networks. PHYSICAL REVIEW. E, STATISTICAL, NONLINEAR, AND SOFT MATTER PHYSICS 2007; 75:056114. [PMID: 17677140 DOI: 10.1103/physreve.75.056114] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Received: 10/09/2006] [Indexed: 05/16/2023]
Abstract
We investigate the influence of the network's size on the degree distribution pi k in Barabási-Albert model of growing network with initial attractiveness. Our approach based on moments of pi k allows us to treat analytically several variants of the model and to calculate the cutoff function, giving finite-size corrections to pi k. We study the effect of initial configuration as well as of addition of more than one link per time step. The results indicate that asymptotic properties of the cutoff depend only on the exponent gamma in the power-law describing the tail of the degree distribution. The method presented in this paper is very general and can be applied to other growing networks.
Collapse
Affiliation(s)
- B Waclaw
- Marian Smoluchowski Institute of Physics, Jagellonian University, Reymonta 4, 30-059 Kraków, Poland.
| | | |
Collapse
|
33
|
Yoon S, Lee S, Yook SH, Kim Y. Statistical properties of sampled networks by random walks. PHYSICAL REVIEW. E, STATISTICAL, NONLINEAR, AND SOFT MATTER PHYSICS 2007; 75:046114. [PMID: 17500968 DOI: 10.1103/physreve.75.046114] [Citation(s) in RCA: 13] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Received: 12/04/2006] [Indexed: 05/15/2023]
Abstract
We study the statistical properties of the sampled networks by a random walker. We compare topological properties of the sampled networks such as degree distribution, degree-degree correlation, and clustering coefficient with those of the original networks. From the numerical results, we find that most of topological properties of the sampled networks are almost the same as those of the original networks for gamma les approximately <3. In contrast, we find that the degree distribution exponent of the sampled networks for gamma>3 somewhat deviates from that of the original networks when the ratio of the sampled network size to the original network size becomes smaller. We also apply the sampling method to various real networks such as collaboration of movie actor, Worldwide Web, and peer-to-peer networks. All topological properties of the sampled networks are essentially the same as those of the original real networks.
Collapse
Affiliation(s)
- Sooyeon Yoon
- Department of Physics and Research Institute for Basic Sciences, Kyung Hee University, Seoul 130-701, Korea
| | | | | | | |
Collapse
|
34
|
Costa LDF, Travieso G. Exploring complex networks through random walks. PHYSICAL REVIEW. E, STATISTICAL, NONLINEAR, AND SOFT MATTER PHYSICS 2007; 75:016102. [PMID: 17358219 DOI: 10.1103/physreve.75.016102] [Citation(s) in RCA: 27] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Received: 07/15/2006] [Revised: 11/09/2006] [Indexed: 05/14/2023]
Abstract
Most real complex networks--such as protein interactions, social contacts, and the Internet--are only partially known and available to us. While the process of exploring such networks in many cases resembles a random walk, it becomes a key issue to investigate and characterize how effectively the nodes and edges of such networks can be covered by different strategies. At the same time, it is critically important to infer how well can topological measurements such as the average node degree and average clustering coefficient be estimated during such network explorations. The present article addresses these problems by considering random, Barabási-Albert (BA), and geographical network models with varying connectivity explored by three types of random walks: traditional, preferential to untracked edges, and preferential to unvisited nodes. A series of relevant results are obtained, including the fact that networks of the three studied models with the same size and average node degree allow similar node and edge coverage efficiency, the identification of linear scaling with the size of the network of the random walk step at which a given percentage of the nodes/edges is covered, and the critical result that the estimation of the averaged node degree and clustering coefficient by random walks on BA networks often leads to heavily biased results. Many are the theoretical and practical implications of such results.
Collapse
Affiliation(s)
- Luciano da Fontoura Costa
- Instituto de Física de São Carlos, Universidade de São Paulo, Caixa Postal 369, São Carlos, Sao Paulo, 13560-970, Brazil.
| | | |
Collapse
|
35
|
The effects of incomplete protein interaction data on structural and evolutionary inferences. BMC Biol 2006; 4:39. [PMID: 17081312 PMCID: PMC1665463 DOI: 10.1186/1741-7007-4-39] [Citation(s) in RCA: 43] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/01/2006] [Accepted: 11/03/2006] [Indexed: 11/24/2022] Open
Abstract
Background Present protein interaction network data sets include only interactions among subsets of the proteins in an organism. Previously this has been ignored, but in principle any global network analysis that only looks at partial data may be biased. Here we demonstrate the need to consider network sampling properties explicitly and from the outset in any analysis. Results Here we study how properties of the yeast protein interaction network are affected by random and non-random sampling schemes using a range of different network statistics. Effects are shown to be independent of the inherent noise in protein interaction data. The effects of the incomplete nature of network data become very noticeable, especially for so-called network motifs. We also consider the effect of incomplete network data on functional and evolutionary inferences. Conclusion Crucially, when only small, partial network data sets are considered, bias is virtually inevitable. Given the scope of effects considered here, previous analyses may have to be carefully reassessed: ignoring the fact that present network data are incomplete will severely affect our ability to understand biological systems.
Collapse
|
36
|
Abstract
The analysis of molecular networks, such as transcriptional, metabolic and protein interaction networks, has progressed substantially because of the power of models from statistical physics. Increasingly, the data are becoming so detailed--though not always complete or correct--that the simple models are reaching the limits of their usefulness. Here, we will discuss how network information can be described and to some extent quantified. In particular statistics offers a range of tools, such as model selection, which have not yet been widely applied in the analysis of biological networks. We will also outline a number of present challenges posed by biological network data in systems biology, and the extent to which these can be addressed by new developments in statistics, physics and applied mathematics.
Collapse
|
37
|
Abstract
In this paper, we discuss statistical families
with the property that if the distribution of a random variable
X
is in
, then so is the distribution of
Z
∼Bi(
X
,
p
) for 0≤
p
≤1. (Here we take
Z
∼Bi(
X
,
p
) to mean that given
X
=
x
,
Z
is a draw from the binomial distribution Bi(
x
,
p
).) It is said that the family is closed under binomial subsampling. We characterize such families in terms of probability generating functions and for families with finite moments of all orders we give a necessary and sufficient condition for the family to be closed under binomial subsampling. The results are illustrated with power series and other examples, and related to examples from mathematical biology. Finally, some issues concerning inference are discussed.
Collapse
Affiliation(s)
- Carsten Wiuf
- Bioinformatics Research Center, University of AarhusHøegh Guldbergsgade 10, 8000 Aarhus C, Denmark
| | - Michael P.H Stumpf
- Division of Molecular Biosciences, Imperial College LondonWolfson Building, London SW7 2AZ, UK
| |
Collapse
|
38
|
Lee SH, Kim PJ, Jeong H. Statistical properties of sampled networks. PHYSICAL REVIEW. E, STATISTICAL, NONLINEAR, AND SOFT MATTER PHYSICS 2006; 73:016102. [PMID: 16486211 DOI: 10.1103/physreve.73.016102] [Citation(s) in RCA: 65] [Impact Index Per Article: 3.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Received: 05/10/2005] [Revised: 10/05/2005] [Indexed: 05/06/2023]
Abstract
We study the statistical properties of the sampled scale-free networks, deeply related to the proper identification of various real-world networks. We exploit three methods of sampling and investigate the topological properties such as degree and betweenness centrality distribution, average path length, assortativity, and clustering coefficient of sampled networks compared with those of original networks. It is found that the quantities related to those properties in sampled networks appear to be estimated quite differently for each sampling method. We explain why such a biased estimation of quantities would emerge from the sampling procedure and give appropriate criteria for each sampling method to prevent the quantities from being overestimated or underestimated.
Collapse
Affiliation(s)
- Sang Hoon Lee
- Department of Physics, Korea Advanced Institute of Science and Technology, Daejeon 305-701, Korea.
| | | | | |
Collapse
|