Big Data for Credit Scoring: Opportunities and Challenges

Contributed by: Bart Baesens

This article first appeared in Data Science Briefings, the DataMiningApps newsletter. Subscribe now for free if you want to be the first to receive our feature articles, or follow us @DataMiningApps. Do you also wish to contribute to Data Science Briefings? Shoot us an e-mail over at briefings@dataminingapps.com and let’s get in touch!


Throughout the past few decades banks have gathered plenty of information describing the default behavior of their customers.  Examples are historical information about a customer’s date of birth, gender, income, employment status, etc.  All this data has been nicely stored into huge (e.g. relational) databases or data warehouses.  On top of this, banks have accumulated lots of business experience about their credit products.  As an example, many credit experts do a pretty good job at discriminating between low risk and high risk mortgages using their business expertise only.  It is now the aim of credit scoring to analyze both sources of data into more detail and come up with a statistically based decision model which allows to score future credit applications and ultimately decide which ones to accept or reject.

The emergence of big data has created both opportunities and challenges for credit scoring.  Big data is often characterized in terms of its 4 V’s: Volume, Variety, Velocity and Veracity.  To illustrate this, let’s briefly zoom into some key sources or processes generating big data.  Traditional sources are large scale transactional enterprise systems such as OLTP (On-Line Transaction Processing), ERP (Enterprise Resource Planning) and CRM (Customer Relationship Management) applications.  Classical credit scorecards are typically constructed using data extracted from these traditional transactional systems.

The online social graph is a more recent example.  Think about the major social networks such as Facebook, Twitter, LinkedIn, Weibo, and WeChat.  All together these networks capture information about close to two billion people about their friends, preferences and other behavior, hereby leaving a massive digital trail of data.  Also think about the Internet of Things (IOT) or the emerging sensor enabled ecosystem that is going to connect various objects (e.g. homes, cars, etc.) with each other, and with humans.  Finally, we see more and more open or public data such as data about weather, traffic, maps, macro-economy, etc.  It is clear that all these new data source offer a tremendous potential for building better credit scoring models.

All the above data generating processes can be characterized in terms of the sheer volume of data that is being generated.  Clearly, this poses seriously challenges in terms of setting up scalable storage architectures combined with a distributed approach to data manipulation and querying.

Big data usually comes in a great variety or various formats.  Traditional data types or structured data such as customer name, customer birth date, etc. are more and more complemented with unstructured data such as images, fingerprints, tweets, emails, Facebook pages, sensor data, GPS data, etc.  Although the former can be easily stored in traditional (e.g. relational) databases, the latter needs to be accommodated using the appropriate database technology facilitating the storage, querying and manipulation of each of these types of unstructured data.  Also here, this requires a substantial effort since it is claimed that at least 80% of all data is unstructured.

Velocity refers to the speed at which the data is generated and needs to be stored and analyzed.  Think about streaming applications such as on-line trading platforms, YouTube, SMS messages, credit card swipes, phone calls, etc. which are all examples where high velocity is a key concern.

Veracity indicates the quality or trustworthiness of the data.  Unfortunately, more data does not automatically imply better data, so the quality of the data generating process must be closely monitored and guaranteed.

As the volume, variety, velocity and veracity of data continues to grow, so do the new opportunities for building better credit scoring models.  Think about Facebook or Twitter as an example.  It is quite obvious that knowing a credit applicant’s hobbies, followers, friends, likes, education and workplace could be very beneficial to better quantify his/her creditworthiness.  In other words, a customer’s social standing, on-line reputation and professional connections are likely to be related to his/her credit quality.  Another useful data source concerns call detail records or CDR data which capture the mobile phone usage of an applicant.  Also surfing behavior could be a nice add-on.

Clearly, the availability of these big data sources creates both opportunities as well as challenges for credit scoring.  E.g., the availability of social network and CDR data may be beneficial in various settings.  First, it may be useful to score customers who lack borrowing experience (e.g. because it’s their first loan or they recently moved to a new country) and would be automatically perceived as risky according to traditional credit scoring models which rely on historical information.  By using these alternative data sources, a better assessment of the credit risk can be made, which can then be translated into a more favorable interest rate.  This obviously gives an incentive to the customer to disclose his/her social network, CDR or other relevant data to the bank.  Another example are developing countries.  In these countries, banks often lack historical credit information and no local credit bureaus may be available.  Hence, other data sources should be used to optimize access to credit.  Given the widespread use of social networks and/or mobile phones (even in developing countries!), the data gathered might be an interesting alternative to do credit scoring.

Obviously, using the above mentioned data sources also comes with various challenges.  The first one concerns privacy. 

Obviously, using the above mentioned data sources also comes with various challenges.  The first one concerns privacy.  It is important that customers are properly informed about what data is used to calculate their credit score.  An opt-out option should always be provided.  Furthermore, using social network data for credit scoring can trigger new fraud behavior whereby customers strategically construct their social network to artificially and maliciously brush up their credit quality.  One example is that customers can easily buy Twitter followers to boost their credit scores.  Finally, also regulatory compliance might become an important issue.  Many countries prohibit the use of gender, age, marital status, national origin, ethnicity and beliefs for credit scoring.  Much of this information can be easily scraped from social networks.  Hence, it may be harder to oversee regulatory compliance when using social network or other data for credit scoring.