Improving the ROI of Big Data and Analytics through Leveraging New Sources of Data

Contributed by: Bart Baesens

This article first appeared in Data Science Briefings, the DataMiningApps newsletter. Subscribe now for free if you want to be the first to receive our feature articles, or follow us @DataMiningApps. Do you also wish to contribute to Data Science Briefings? Shoot us an e-mail over at and let’s get in touch!

Big Data and Analytics are all around these days.  Most companies already have their first analytical models in production and are thinking about further boosting their performance.  Far too often, they hereby focus on the analytical techniques rather than on the key ingredient: data!  We believe the best way to boost the performance and ROI of an analytical model is by investing in new sources of data which can help to further unravel complex customer behavior and improve key analytical insights.  In what follows, we briefly explore various types of data sources that could be worthwhile pursuing in order to squeeze more economic value out of your analytical models.

A first option concerns the exploration of network data by carefully studying relationships between customers.  These relationships can be explicit or implicit.  Examples of explicit networks are calls between customers, shared board members between firms, and social connections (e.g., family, friends ).  Explicit networks can be readily distilled from underlying data sources (e.g., call logs) and their key characteristics can then be summarized using featurization procedures resulting into new characteristics which can be added to the modeling data set.  In our previous research (Verbeke et al., 2014; Van Vlasselaer et al., 2017), we found network data to be highly predictive for both customer churn prediction and fraud detection.  Implicit networks or pseudo networks are a lot more challenging to define and featurize.  Martens and Provost (2016) built a network of customers where links were defined based upon which customers transferred money to the same entities (e.g., retailers) using data from a major bank.  When combined with non-network data, this innovative way of defining a network based upon similarity instead of explicit social connections gave a better lift and generated more profit for almost any targeting budget.  In another, award-winning study they built a geosimilarity network among users based upon location-visitation data in a mobile environment (Provost et al., 2015).  More specifically, two devices are considered similar and thus connected, when they share at least one visited location.  They are more similar if they have more shared locations and as these are visited by fewer people.  This implicit network can then be leveraged to target advertisements to the same user on different devices or to users with similar tastes, or to improve online interactions by selecting users with similar tastes.  Both of these examples clearly illustrate the potential of implicit networks as an important data source.  A key challenge here is to creatively think about how to define these networks based upon the goal of the analysis.

Data are often branded as the new oil.  Hence, data pooling firms capitalize on this by gathering various types of data, analyzing them in innovative and creative ways, and selling the results thereof.  Popular examples are Equifax, Experian, Moody’s, S&P, Nielsen, and Dun & Bradstreet, among many others.  These firms consolidate publically available data, data scraped from websites or social media, survey data, and data contributed by other firms.  By doing so, they can perform all kinds of aggregated analyses (e.g., geographical distribution of credit default rates in a country, average churn rates across industry sectors), build generic scores (e.g., the FICO in the US) and sell these to interested parties.  Because of the low-entry barrier in terms of investment, externally purchased analytical models are sometimes adopted by smaller firms (e.g., SMEs) to take their first steps in analytics.  Besides commercially available external data, open data can also be a valuable source of external information.  Examples are industry and government data, weather data, news data, and search data (e.g., Google Trends).  Both commercial and open external data can significantly boost the performance and thus economic return of an analytical model.

Macro-economic data are another valuable source of information.  Many analytical models are developed using a snapshot of data at a particular moment in time.  This is obviously conditional on the external environment at that moment.  Macro-economic up- or down-turns can have a significant impact on the performance and thus ROI of the analytical model.  The state of the macro-economy can be summarized using measures such as gross domestic product (GDP), inflation and unemployment.  Incorporating these effects will allow us to further improve the performance of analytical models and make them more robust against external influences.

Textual data are also an interesting type of data to consider.  Examples are product reviews, Facebook posts, Twitter tweets, book recommendations, complaints, and legislation.  Textual data are difficult to process analytically since they are unstructured and cannot be directly represented into a matrix format.  Moreover, these data depend upon the linguistic structure (e.g., type of language, relationship between words, negations, etc.) and are typically quite noisy data due to grammatical or spelling errors, synonyms and homographs.  However, they can contain very relevant information for your analytical modeling exercise.  Just as with network data (see above), it will be important to find ways to featurize text documents and combine it with your other structured data.  A popular way of doing this is by using a document term matrix indicating what terms (similar to variables) appear and how frequently in which documents (similar to observations).  It is clear that this matrix will be large and sparse.  Dimension reduction will thus be very important as the following activities illustrate:

  • represent every term in lower case (e.g., PRODUCT, Product, product become product)
  • remove terms which are uninformative such as stop words and articles (e.g., the product, a product, this product become product)
  • use synonym lists to map synonym terms to one single term (product, item, article become product)
  • stem all terms to their root (products, product become product)
  • remove terms that only occur in a single document

Even after the above activities have been performed, the number of dimensions may still be too big for practical analysis.  Singular Value Decomposition (SVD) offers a more advanced way to do dimension reduction (Meyer, 2000).  SVD works similar to principal component analysis (PCA) and summarizes the document term matrix into a set of singular vectors (also called latent concepts) which are linear combinations of the original terms.  These reduced dimensions can then be added as new features to your existing, structured data set.

Besides textual data, other types of unstructured data such as audio, images, videos, fingerprint, GPS, and RFID data can be considered as well.  To successfully leverage these types of data in your analytical models, it is of key importance to carefully think about creative ways of featurizing them.  When doing so, it is recommended that any accompanying metadata are taken into account; for example, not only the image itself might be relevant, but also who took it, where, and at what time.  This information could be very useful for fraud detection.

To summarize, we strongly believe that the best way to boost the performance and ROI of your analytical models is by investing in data first!  In this contribution, we gave some examples of alternative data sources which can contain valuable information about the behavior of your customers.


  • Martens D., Provost F., Mining Massive Fine-Grained Behavior Data to Improve Predictive Analytics, MIS Quarterly, Volume 40, Number 4, pp. 869-888, 2016.
  • Meyer C.D., Matrix Analysis and Applied Linear Algebra, SIAM, Philadelphia, 2000.
  • Provost F., Martens D., Murray A., Finding Similar Mobile Consumers with a Privacy-Friendly Geosocial Design, Information Systems Research, Volume 26, Issue 2, pp. 243 – 265, 2015.
  • Van Vlasselaer V., Eliassi-Rad T., Akoglu L., Snoeck M., Baesens B., GOTCHA! Network-based Fraud Detection for Security Fraud, Management Science, forthcoming, 2017
  • Verbeke W., Martens D., Baesens B., Social network analysis for customer churn prediction, Applied Soft Computing, Volume 14, pp. 341-446, 2014.