On Data Variety

Contributed by: Jasmien Lismont, Jan Vanthienen, Wilfried LemahieuBart Baesens

This article first appeared in Data Science Briefings, the DataMiningApps newsletter. Subscribe now for free if you want to be the first to receive our feature articles, or follow us @DataMiningApps. Do you also wish to contribute to Data Science Briefings? Shoot us an e-mail over at briefings@dataminingapps.com and let’s get in touch!

Recently the FBI fought Apple for access to data contained in the iPhone of a shooting suspect. In the end, the bureau managed to acquire the data themselves by applying a third party forensics tool. The whole dispute caused a lot of discussion with regards to data privacy but it moreover illustrated again how important data is in our current society.

What is data? According to Ackoff [1] data are “symbols that represent the properties of objects and events”. In a more recent paper on Open Data, Colpaert and his colleagues [2] discuss data as triples containing an entity, an attribute and a value, e.g. <DataMiningApps> <is located in> <Leuven>. It seems like our definition of data has not changed very much. But the data itself has and so have its applications. The current Big Data hype illustrates this. Big Data is frequently described by means of 3 Vs: volume, velocity, and variety. Essentially it says that data is coming in larger volumes, at a higher speed and in a larger variety of types. For example, imagine yourself browsing through Amazon, creating a large amount of clickstream data in only a few minutes, leaving text data in the form of reviews and receiving up-to-date recommendations from Amazon.

In this contribution we’d like to focus on variety which has received less attention in comparison to the other big data characteristics.

Typical marketing campaigns (used to) focus mainly on sociodemographic data, collected by means of surveys or bought from external companies such as Nielsen or GfK. However, already in ’96 Rossi and his colleagues [3] proved how valuable purchase or transactional data are in terms of performance. In order to stay competitive, companies thus need to expand the data types they use for their churn prediction, recommender systems, customer segmentation, credit risk analytics, etc.

Which data types are out there and has their value changed in the current landscape? Our research has taught us that sociodemographics data are still frequently collected and used for analytics purposes. But also purchase data is now more commonly collected and used, although this type of data is more difficult to collect [4]. However, a lot of new data types have originated, e.g. social network data, environmental data, geographic data, clickstream data, text data, video data, etc. And although companies are currently often employing a policy of storing as much data as possible – the term data lake is popping up more and more frequently – they certainly not always (know how to) use it for analytics purposes.

There are three general trends that urge companies to dive into these new, rather unexplored data types. Firstly, companies that are active in saturated industries such as telecommunications, are fighting for the margins, which means that every customer counts. Improved performance of analytical models, such as churn prediction models, are thus key. One possible way to still improve performance, is by reaching out to new data types. For example, performing graph mining on social network data, analyzing customer complaints, or employing geographical data in your churn predictions. Secondly, where the former male/female and family with or without kids/single could perfectly be used to describe a person, this is nowadays no longer the case. Some variables can no longer be used in certain settings, e.g. see anti-discrimination laws in credit scoring, but, moreover, individuals are becoming more diverse. Blended families, globalization and migration trends, and gender blurring are only a few examples which make it harder to define persons by means of variables. This is where purchase data, social network data, etc. can provide a more objective and more correct view. Or companies can turn to sentiment analysis by means of analyzing, click-, text- and video-streams, to discover their customers’ true opinion. Thirdly, digitization and automation are changing our landscape and create enormous amounts of possibilities. Take for example a look at Google. Based on your user profile demographics, click behavior and location data Google can adapt their advertisement strategies. Digitization is thus providing the data for new analytics techniques the data scientists among us are creating. Next to digitization, also automation is increasing data volume, velocity and variety. Hospitals can track patients by means of RFID and improve their processes. Automated productions processes can be analyzed by means of their tag data. Customers can scan QR codes in order to select and pay for their products creating marketing analytics opportunities.

In the end, regardless the way one looks at data, its importance for analytics-driven insights is indisputable.


  • [1] Ackoff, R. L. 1989. “From Data to Wisdom,” Journal of applied systems (16:1), pp. 3-9.
  • [2] Colpaert, P., Van Compernolle, M., De Vocht, L. et al. 2014. “Quantifying the Interoperability of Open Government Datasets,” Computer (47:10), pp. 50-56.
  • [3] Rossi, P. E., McCulloch, R. E., Allenby, G. M. 1996. “The Value of Purchase History Data in Target marketing,” Marketing Science (15:4), pp. 321-340.
  • [4] Davenport, T. H., Mule, L. D., Lucker, J. 2011. “Know what your customers want before they do,” Harvard Business Review (89:12), pp. 84-92.