Can you explain how the Jaccard index can be used for distance calculation?

By: Bart Baesens, Seppe vanden Broucke

This QA first appeared in Data Science Briefings, the DataMiningApps newsletter as a “Free Tweet Consulting Experience” — where we answer a data science or analytics question of 140 characters maximum. Also want to submit your question? Just Tweet us @DataMiningApps. Want to remain anonymous? Then send us a direct message and we’ll keep all your details private. Subscribe now for free if you want to be the first to receive our articles and stay up to data on data science news, or follow us @DataMiningApps.

You asked: Can you explain how the Jaccard index can be used for distance calculation?

Our answer:

The Jaccard index is often used in insurance fraud detection methods which are typically based on a series of red flag indicators to label a claim as suspicious or not.

Assume we have the following data set with binary red flag indicators:

  Poor driving record Premium paid in cash Car purchase information available Coverage
Car was never inspected or seen
Claim 1 Yes No Yes Yes No
Claim 2 Yes Yes No No No
 …  …  …  …  …

A first way to calculate the distance or similarity between claim 1 and 2 is to use the simple matching coefficient (SMC) which simply calculates the number of identical matches between the variable values as follows:

SMC(Claim 1, Claim 2) = 2/5

A tacit assumption behind the SMC is that both states of the variable (Yes versus No) are equally important and should thus both be considered. Another option is to use the Jaccard index whereby the No-No match is left out of the computation as follows:

Jaccard(Claim 1, Claim 2)=1/4

The Jaccard index measures the similarity between both claims across those red flags that where raised at least once.  It is especially useful in those situations where many red flag indicators are available and typically only a few are raised.  Consider e.g. a fraud detection system with 100 red flag indicators of which on average 5 are raised.  If you would use the simple matching coefficient, then typically all claims would be very similar since the 0-0 matches would dominate the count, hereby creating no meaningful clustering solution.

By using the Jaccard index a better idea of the claim similarity can be obtained.  The Jaccard index has been very popular in fraud detection.