On Sampling Strategies for Large Graphs

Contributed by: Sandra Mitrović, Bart Baesens

This article first appeared in Data Science Briefings, the DataMiningApps newsletter. Subscribe now for free if you want to be the first to receive our feature articles, or follow us @DataMiningApps. Do you also wish to contribute to Data Science Briefings? Shoot us an e-mail over at briefings@dataminingapps.com and let’s get in touch!


Social Network Analysis (SNA) has recently been gaining more and more popularity in various domains. Unfortunately, performing SNA is not always an easy task. Not because it is difficult to construct corresponding network (although, this might, as well, happen in some cases), but since, due to the volume of data which translates to huge network/graph, it is very time consuming and computationally expensive to perform analysis on these graphs. Depending on the type of task, handling graphs with even just dozens of thousands of nodes can be unfeasible, as some studies show. An intuitive solution to deal with this situation, just as in any scenario where we have a massive amount of data, is to sample the graph and then perform relevant simulation/analysis on obtained sub-graph.

The main question here is then: how to sample – that is: what sampling strategy to choose?

To tackle this question, one should first decide which of the three typical sampling approaches to follow:

  1. randomly selecting nodes;
  2. randomly selecting edges;
  3. an exploration (or traversal) approach.

With the node-based sampling, first the nodes are selected based on some criteria and then the edges whose both nodes are already selected, are also included in the sample. Nodes can be selected uniformly (known as Random Node sampling), or proportionally, according to some criteria, for example: node degree (known as Random Degree Node sampling), node Page Rank weights (known as Random Page Rank Node sampling).

In the edge-based approach, procedure is reversed to previously explained: first the edges are selected and then the nodes, which those edges connect, are added. Obviously, the selection can again be done in a uniform or non-uniform manner, but the problems here arise very often, since the obtained sub-graphs are usually very sparse and with large diameters. Hence, random edge sampling algorithm is very frequently combined with some other, typically again node-based sampling, thus forming some kind of hybrid sampling approach. For example, one can alternate between a step of node sampling with probability p and a step of edge sampling with probability (1-p).

Exploration or traversal (also called topology-based) approaches are based on the idea of randomly selecting one node and then exploring its neighborhood. Exploration sampling algorithms are numerous, some of the most known and frequently used once are: Breath First Sampling, Depth First Sampling, Random Walk sampling, Metropolis-Hastings Random Walk sampling, Multi-Dimensional Random Walk sampling (also known as Frontier Sampling), Forest Fire sampling. Let us assume that we have randomly chosen a seed node u. With Breath First Sampling, the next step would be to randomly select k nodes among the direct neighbors of node u. With Depth First Sampling, we would next select one node v among the direct (1st level) neighbors of node u, then one node w among the 1st level neighbors of node v (hence 2nd level neighbors of node u) etc. till we select all k nodes. In Random Walk sampling, we would simulate random walk starting from node u, which would mean that either with probability p (usually p=0.85) the next node v would be selected among the neighbors of node u according to the transition probability p’ from u to v, or, with probability 1-p, we would jump to any arbitrary node t (potentially also back to node u) and restart the random walk. So far mentioned approaches make samples biased to high-degree nodes. Metropolis-Hastings Random Walk sampling allows correcting this bias and performs uniform sampling. In this approach, starting from node u, we would simulate Markov Chain Monte Carlo (MCMC) sampling algorithm, that is, adjust the probability of accepting node’s u neighbor node v, so that the resulting random walk converges to desired (uniform) distribution. Frontier Sampling starts from m randomly chosen nodes and performs m dependent random walks in the graph. It was shown that it performs better than Random Walk sampling and that both Metropolis-Hastings Random Walk sampling and Frontier sampling perform better in tightly connected networks. In the Forest Fire sampling, starting from node u, the idea is to ‘burn’ the links to a fraction of its neighbors according to the geometric distribution and then to repeat the process recursively for each one of these neighbors until reaching sufficient number of ‘burned’ nodes.

As it can be seen, with each of these approaches, the sample size has to be chosen in advance and determining the adequate sample size is another important aspect of sampling.

Sampling can also be done with the objective of preserving network properties. A simple example of sampling based on particular node property (hence, in this case, a node-based sampling) can be found in social sciences where a sub-population with particular characteristics has to be examined. However, this most trivial scenario is often very rare, and instead, a sampling preserving one or more graph properties has to be performed. Making a decision on which exactly graph properties should be maintained is typically not straightforward, but in some cases it might be possible to conclude that particular properties should be retained, for example, in- and out-degree distributions. The most common situation (and in the same time, most complex one) is, however, when we do not exactly know which features should be preserved. Instead, the aim is to have the sub-graph whose properties relate very closely to the ones of a referent graph. The referent graph can be our original, large graph (in which case we are talking about so-called Scale-down goal), but it can also, as well, be a past snapshot of our graph (so-called Back-in-time goal). This can be very interesting for large graphs evolving in time and/or size and obviously the snapshot can be determined by fixing certain moment in time T or particular size S.

Evaluating performed sampling is yet another question which also requires thorough and separate elaboration.

In the end, let us mention that literature has seen different purposes of graph sampling, other than the sampling for performing simulations and further analysis on sub-graphs. For example, graph sampling is also used with the aim to facilitate visualization, for outlier detection, to enable more efficient storage and retrieval of web-graphs etc.

References

  • Leskovec, Jure and Faloutsos, Christos. “Sampling from Large Graphs”, KDD ’06 Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining, p. 631-636
  • Ribeiro, Bruno, and Towsley. Don. “Estimating and sampling graphs with multidimensional random walks”, Proceedings of the 10th ACM SIGCOMM conference on Internet measurement. ACM, 2010.
  • Wang, Tianyi, et al. “Understanding graph sampling algorithms for social network analysis”, 31st International Conference on Distributed Computing Systems Workshops. IEEE, 2011.