It’s the Network, you Stupid!

Posted on July 4, 2015

By: Bart Baesens, Véronique Van Vlasselaer, Wouter Verbeke, Seppe vanden Broucke

This article first appeared in Data Science Briefings, the DataMiningApps newsletter. Subscribe now for free if you want to be the first to receive our feature articles, or follow us @DataMiningApps. Do you also wish to contribute to Data Science Briefings? Shoot us an e-mail over at briefings@dataminingapps.com and let’s get in touch!

It is estimated that, each year, a typical organization loses about 5% of its revenues due to fraud. For instance, the total cost of non-health insurance fraud in the U.S. is estimated to be more than $40 billion per year (www.fbi.gov). Other popular fraud examples are credit card fraud, anti-money laundering, healthcare fraud, telecommunications fraud, click fraud, tax evasion, and counterfeit. Given today’s fierce competitive environment, narrowing margins and increasing shareholder pressure, more and more firms are considering fraud detection and prevention as a key strategic priority. To continuously outsmart the fraudsters, who keep on perfectioning their tactics thanks to new emerging technology, firms need new, efficient and effective tools. Since data is becoming available in abundance and at a low cost, many firms are investing in massive big data storage platforms (based on e.g. Hadoop). These untapped data reservoirs provide an enormous potential for analytics to learn what mechanisms are being used by fraudsters.

Recently, we engaged into various active research partnerships to study how analytics can be used to detect both tax evasion and credit card fraud. In this short article, we would like to share some of our key research findings.

Successful fraud analytical models should satisfy various requirements. First, they should achieve good statistical performance in terms of recall or hit rate, which is the percentage of fraudsters labelled by the analytical model as suspicious, and precision, which is the percentage of fraudsters amongst the ones labeled as suspicious. Next, the analytical models should not be based on complex mathematical formulas (such as neural networks, support vector machines, …) but should provide clear insight into the fraud mechanisms adopted. This is particularly important since the insights gained will be used to develop new fraud prevention strategies. Also the operational efficiency of the fraud analytical model needs to be evaluated. This refers to the amount of resources needed to calculate the fraud score and adequately act upon it. E.g., in a credit card fraud environment, a decision needs to be made within a few seconds after the transaction was initiated.

Given all the aforementioned requirements, it is rather obvious that simple analytical techniques (e.g. regression based) should be adopted to detect fraud. To make sure that these models give the best performance possible, they should be fed with the best data available, so let’s zoom into the data aspect into some more detail. Many analytical models assume that customer behavior is independent, aka the well-known IID assumption in statistics. Throughout our research in fraud detection, we have consistently found this assumption to be false! Fraud is a social phenomenon and fraudsters often collaborate to set up complex fraudulent patterns (e.g. collusion). This principle is often referred to as homophily implying that people like those who are like themselves, or in our context, fraudsters are likely to be connected (in some way) to other fraudsters. Hence, to fully exploit the data available, we should start thinking about defining networks between the various data observations. A network consists of nodes and edges which represent relationships between the nodes. A first example is corporate tax evasion whereby the nodes correspond to companies and the edges to relationships between them based on shared resources such as infrastructure, employees, equipment, etc. Another example is a credit card fraud detection setting, where two types of nodes can be distinguished: credit cards and merchants (academics would call this a bipartite graph). A link between a credit card and a merchant then corresponds to a transaction.

Networks are important, but a key challenge concerns their definition, which highly depends upon the fraud setting you are working in. Ideally, a business expert (e.g. fraud analyst) should provide us with some starting insights about what might be interesting to consider as nodes and edges. Depending upon the application area, different types of nodes can be distinguished (e.g. customer, claim, car, mobile, car repair shop, …). As said, the edges represent relationships and can be weighted, e.g. based upon the recency of the relationship, whereby more recent connections have a higher weight. Once the network has been properly defined, it can be visually explored and statistically tested for homophily. In case homophily is detected, the network can be featurized. The idea here is to create features summarizing the key characteristics of the network, and add those to the data for analytical modeling. A simple example of a feature could be the number of fraudulent customers that a given customer is connected to. The more network features can be added to the data, the better since every analytical technique has built-in facilities to detect which are the important ones. Essentially, featurization is a data enrichment operation, which will allow the analytical technique (e.g. regression) to disentangle complex, network based fraud patterns.

To summarize this short article, we would like to re-iterate its key take aways:

Analytical fraud models should have high hit rates and precision, good interpretability and operational efficiency;
Fraud is a social phenomenon; fraudsters often act in collaboration with other fraudsters;
Networks are important to disentangle complex fraud patterns;
Featurization is a data enrichment operation whereby network features are added to the data for improved analytical modeling.

We welcome any shared experiences (both confirming and contradicting).

For more on fraud analytics, see:

Baesens B., Van Vlasselaer V., Verbeke W., Fraud Analytics Using Descriptive, Predictive, and Social Network Techniques: A Guide to Data Science for Fraud Detection, Wiley, ISBN 1119133122, 2015.