A (Good) Picture is Worth a Thousand Words: Visualizing Complex Data for Business Value with R

Posted on September 24, 2017

Contributed by: Jasmien Lismont, Sandra Mitrović, María Óskarsdóttir, Michael Reusens, Tine Van Calster, Jochen De Weerdt, Wilfried Lemahieu, Bart Baesens, Jan Vanthienen

This article first appeared in Data Science Briefings, the DataMiningApps newsletter. Subscribe now for free if you want to be the first to receive our feature articles, or follow us @DataMiningApps. Do you also wish to contribute to Data Science Briefings? Shoot us an e-mail over at briefings@dataminingapps.com and let’s get in touch!

Although the focus of data mining and analytics is to discover patterns and gain insights from data, the presentation of observations and results in a meaningful and insightful way is equally important. This is especially the case in industry, where understanding models and their implications is essential for their deployment. However, as data grows in both volume and complexity, and models frequently become more performant by sacrificing understandability, challenges arise. This brief overview article focuses on several business-driven applications and how they deal with these challenges by taking advantage of several interesting R packages. Firstly, we present visualizations that are able to represent numerous observations or, alternatively, a high number of dimensions. Secondly, we deal with the challenge of understandability by showing several applications which help to explain insights from data and models in a less complex way.

Before data can be analyzed, it is important to properly explore the collected data. For this purpose, we highlight the likert package in combination with ggplot2 for analyzing survey responses on the popularity of different analytics techniques. As a preliminary step in understanding given data, one can also exploit the corrplot package to gain insights into whether there exist (and to which extent) correlations between different pairs of variables. Next, we turn to radar plots provided by the fmsb package for visualizing geo-location dependent customer behaviour, which can for instance be useful in banking. Informative visualizations of networked data can be obtained by applying the novel ggraph package directly to igraph objects, which we have, for instance, implemented for a use case on turnover in employee networks. In another application, we have used the Rtsne package in combination with ggplot2 to generate embeddings that provide visual insights into job seekers’ features in a job recommendation setting.

Secondly, as mentioned above, we can also concentrate on the illustration of model results. For example, the impact of variables in logistic regression models can be visualized by means of a colored nomogram created by means of the VRPM package. Similarly, we can try to better comprehend random forests, which are wellperforming but black box models by nature. Variable importance can be calculated by means of rfPermute and accordingly visualized in ggplot2, e.g. by measuring the decrease in node impurity if we would leave out a particular variable. Finally, the visualization of large tree structures can be challenging, both in terms of memory use and being able to actually gain useful insights from visualizing this type of data. By combining the data.tree package with the networkd3 package, an interactive radial network can be built that can help detect peculiarities in hierarchical data, illustrated here with geographical time series, or better understand the results of tree-based models that contain a large amount of variables, such as decision trees or hierarchical clustering.

In this brief abstract, we have showed several real-life applications of visualization techniques in R to illustrate how an image can help us understand large data volumes and varieties as well as complex modelling outcomes. A good picture will not only guide data scientists in their analysis, but will also assist analysts in bringing their results to the business and deriving real value from data.