This article first appeared in Data Science Briefings, the DataMiningApps newsletter. Subscribe now for free if you want to be the first to receive our feature articles, or follow us @DataMiningApps. Do you also wish to contribute to Data Science Briefings? Shoot us an e-mail over at firstname.lastname@example.org and let’s get in touch!
Networked data are inevitable in various reallife situations and domains. Basically, any collection of relationships between any arbitrary type of entities represents a network. Depending on different domains, these entities are typically called nodes and the links between them are referred to as edges. Concrete examples include: proteinprotein interaction (PPI) networks in biology, transportation networks, social networks, retail networks (e.g. Amazon), citation networks, etc. This type of data is therefore used in many applications, ranging from fraud detection and churn prediction to the optimization of traffic. However, gaining insights from and fully exploiting the potential of networked data can be challenging. From gathering and structuring the data to building graphs and extracting information, the possibilities at each step in the process are abundant. In this abstract, we aim to provide an overview of this workflow as a whole, while discussing some of the possibilities available at each particular step and drawing special attention on R packages that can be utilized (independently or combined) for handling networked data.
Firstly, structuring data for efficient storing and manipulation is the initial step when working with networked data. The package igraph supports different structures that can be used for graph representations, e.g. adjacency matrices or edge lists. However, large graphs are often sparse, which requires some special attention. In R, the Matrix package can generate sparse matrices or the slam package can convert triplet representations into sparse matrices. Once the data structure is in order, we can take a look at graph topologies. We can distinguish between unipartite, bipartite, and npartite graphs indicating the number of node types (e.g. authors and papers). Additionally, multigraphs are another type of graphs where the same pair of nodes can be connected with multiple (types of) edges. Due to the fact that reallife problems typically do not require capturing only network topology, but also different characteristics of nodes and/or edges, these are often enriched with additional attributes. This kind of networks are known as labeled networks. Depending on the topology and the final goal of our analysis, we can transform our graph, e.g. using the Matrix package to transform bipartite networks into unipartite ones, or add attributes, such as edge weights, in the igraph and sna packages.
Once the network is constructed, it can be used to gain new insights by different types of analysis. The first and most straightforward method is to simply visualize the graph, using igraph, ggraph or sna, to, for example, discover communities within the network. However, we can also extract network features, such as centrality measures (e.g. degree) that can be calculated using the igraph and sna packages, or features that can be derived from node/edge attributes. These networkbased attributes can then play a vital role in e.g. classification applications. An igraph object with multiple node or edge attributes can easily be converted into a data.frame for further analysis. Thirdly, network learning, such as predicting links and labels of nodes, can be performed. Finally, the igraph package offers functions for graph sampling which can be useful for large networks.
Networked data can be complex and cumbersome to work with. In this abstract we presented an overview of the process and possibilities when working with networks. Only when tackled appropriately will the networks show us what they are really made of!