Towards AI Can Help your Team Adopt AI: Corporate Training, Consulting, and Talent Solutions.


Insurance Fraud Detection with Graph Analytics

Insurance Fraud Detection with Graph Analytics

Last Updated on July 31, 2022 by Editorial Team

Author(s): Tony Zhang

Originally published on Towards AI the World’s Leading AI and Technology News and Media Company. If you are building an AI-related product or service, we invite you to consider becoming an AI sponsor. At Towards AI, we help scale AI and technology startups. Let us help you unleash your technology to the masses.

Explore graph database and extract graph features to augment the machine learning model

Photo by Conny Schneider on Unsplash

Insurance fraud is an enormous problem, and the insurance industry has been fighting fraud for a very long time. Let’s see the following headline new and statistics:

Insurance fraud costs U.S. consumers at least $80 billion every year, according to The Coalition Against Insurance Fraud (CAIF). CAIF also estimates that workers’ compensation insurance fraud alone costs insurers and employers $30 billion a year. The FBI reports that non-health insurance portion of the overall problem amounts to $40 billion annually. Ultimately, the massive scope of the problem can mean the average U.S. family will pay an extra $400 and $700 per year in premiums.

Health insurance fraud is a type of white-collar activity wherein dishonest claims are filed to gain a profit. The fraudulent activity may involve peers of providers, physicians, and beneficiaries acting together to make fraud claims. Hence, it is difficult to detect fraudulent activities given the complexity of the parties involved. The anomalous behavior could be a fraud ring where providers, doctors, and clients file small claims over time. Such small claims can be difficult to detect with traditional tools, which look at the claims individually and cannot uncover the connections between the various players.

A graph database is designed to fill the gap by storing and handling the highly connected data where the relationship is as important as the individual data point. Hence, we can utilize graph analytics to understand the relationships. In this article, you will learn :

  • Graph basic
  • Graph search and query to understand the relationships
  • Augment machine learning model with graph features

Graph basic

What is graph?

In mathematics, and more specifically in graph theory, a graph is a structure amounting to a set of objects in which some pairs of the objects are in some sense “related”. The objects correspond to mathematical abstractions called vertices (also called nodes or points) and each of the related pairs of vertices is called an edge (also called link or line).

Graph concept, image by author

Simply put, a graph is a mathematical representation of any type of network with:

  • Vertex, sometimes called a node, in the insurance context, can be:
    – Claim
    – Policyholder
  • Edge is the relationship/interaction/behavior between the nodes:
    – Policyholder of claim
    – Insured of claim

Graph algorithm concepts will be introduced in the third chapter — Augment machine learning model with graph features.

Graph search and query

Graph database is purpose-built to store and navigate the relationships utilizing the connectivity between vertices. End users don’t need to execute countless joins as the graph query language is about pattern matching between nodes. This is more natural to work with and can be easily used by business users whose feedback can be fed into the fraud detection system.

Graph visualization through the graph query language helps to analyze a large amount of data and identify patterns that indicate fraudulent activities. In this section, I will share a few scenarios, and the visualizations are based on:

Fraud signal warning

Find all claims filed by the policyholder of fraudulent claim “C4377” and show the diseases of claim “C4377” patient.

Potential fraudulent claim signals, image by author

To deep dive into this policyholder(PH3759), we see that this person saw different doctors at different providers, which is not normal.

Image by author

Policyholder connections related to the fraudulent claim

The graph shows the connections of the high-risk profile that has the fraud claim “C4377”. We see one high-risk policyholder in the 1st-degree connection and another high-risk policyholder in the 3rd-degree connection.

Image by author

Augment machine learning model with graph features

Feature engineering is the art of formulating useful features from existing data. The definition is:

Feature engineering is the process of transforming raw data into features that better represent the underlying problem to the predictive models, resulting in improved model accuracy on unseen data.

Many machine learning applications typically rely on tabular data models and omit the relationship & contextual data, which are the strongest predictors of behavior. This is crucial as each individual claim/provider may look legitimate by avoiding red flags.

In this section, I will walk through a concrete example of how to transform the tabular data into a graph and extract graph features to augment the machine learning model. The overall approach is:

  • Ingest tabular data into graph structure in Python
  • Feature engineering on the graph data
  • Incorporate graph features into the machine learning pipeline


  • we are predicting whether a claim is fraudulent or not based on the provider’s fraudulent flag
  • the idea is to show the process of creating potential predictive features and augment the current machine learning pipeline

Ingest Tabular Data into Graph Structure

In this section, I will use:

The dataset is from Medicare and consists of claims filed by the providers as well as information about the beneficiary for every claim:

  • Patient-related features: age, gender, location, health conditions, etc.
  • Claim-related features: start & end date, claim amount, diagnosis code, procedure code, provider, attending physician, operating physician, etc.

Nodes in this article are providers and attending physicians who are source and target, respectively.

# import package
from igraph import *
# use time-based split to avoid data leakage
Train_Inpatient_G = Train_Inpatient[Train_Inpatient['ClaimStartDt'] < '2009-10-01'][['Provider', 'AttendingPhysician', 'ClaimID']].reset_index(drop = True)
Train_Outpatient_G = Train_Outpatient[Train_Outpatient['ClaimStartDt'] < '2009-10-01'][['Provider', 'AttendingPhysician', 'ClaimID']].reset_index(drop = True)
# merge inpatient and outpatient data
G_df = pd.concat([Train_Inpatient_G, Train_Outpatient_G], ignore_index = True, sort = False)
# implement edge with weight
G_df = G_df[['Provider', 'AttendingPhysician', 'ClaimID']]
G_df = G_df.groupby(['Provider', 'AttendingPhysician']).size().to_frame('Weight').reset_index()
source = 'Provider'
target = 'AttendingPhysician'
weight = 'Weight'
G_df[source] = G_df[source].astype(str)
G_df[target] = G_df[target].astype(str)
# create graph from dataframe
G = Graph.DataFrame(G_df, directed=False)

Feature engineering on graph data

Using graph algorithms, I create new potential meaningful features, such as connectivity metrics and clustering features based on the relationship. In order to prevent target leakage, I use the time-based split to generate graph features. In this section, I will provide code for graph features:


The node can be defined as the number of edges incident to a node.

degree = pd.DataFrame({'Node': G.vs["name"],
'Degree': G.strength(weights =['Weight'])})


In a connected graph, closeness centrality (or closeness) of a node is a measure of centrality in a network, calculated as the reciprocal of the sum of the length of the shortest paths between the node and all other nodes in the graph.

closeness = pd.DataFrame({'Node': G.vs["name"],
'Closeness': G.closeness(weights =['Weight'])})


Infomap is a graph clustering algorithm capable of achieving high-quality communities.

communities_infomap = pd.DataFrame({'Node': G.vs["name"],
'communities_infomap': G.community_infomap(edge_weights =['Weight']).membership})

The resulting metrics from the graph algorithms are transformed into a table so that the generated features can be used in a machine learning model.

# merge graph features
graph_feature = [degree, closeness, eigenvector_centrality, pagerank, communities_infomap, community_multilevel]
graph_feature = reduce(lambda left,right: pd.merge(left,right, how = 'left',on='Node'), graph_feature)

Incorporate graph features into the machine learning pipeline

The graph features are merged into the raw model-ready data, and the time-based split is utilized to prepare the data for the machine learning model.

train = Final_Dataset_Train[Final_Dataset_Train['ClaimStartDt'] < '2009-10-01'].reset_index(drop = True).drop('ClaimStartDt', axis = 1)
test = Final_Dataset_Train[Final_Dataset_Train['ClaimStartDt'] >= '2009-10-01'].reset_index(drop = True).drop('ClaimStartDt', axis = 1)
x_tr = train.drop(axis=1,columns=['PotentialFraud'])
y_tr = train['PotentialFraud']
x_val = test.drop(axis=1,columns=['PotentialFraud'])
y_val = test['PotentialFraud']

In this article, I explored 6 scenarios with the combination of 2 algorithms and 3 feature spaces:

  • Algorithm:
    – Logistic Regression
    – Random Forest
  • Feature space:
    – original features
    – original features + node level features
    – original features + node level features + clustering features
lr = LogisticRegression(penalty='none', solver='saga', random_state=42, n_jobs=-1)
rf = RandomForestClassifier(n_estimators=300, max_depth=5, min_samples_leaf=50,
max_features=0.3, random_state=42, n_jobs=-1), y_tr), y_tr)
preds_lr = lr.predict_proba(x_val)[:,1]
preds_rf = rf.predict_proba(x_val)[:,1]

Here I use AUC as the evaluation metric on the test set and the ROC curve is below:

AUC comparison, image by author

Generally, as the plot above, we can summarize:

  • Outperformance by the model with graph features
  • Outperformance by Random Forest with node level and clustering features

In this article, I didn’t dive deep into model interpretation and model diagnosis. Earlier, I wrote an article on model interpretation — Deep Learning Model Interpretation Using SHAP, which you may be interested in.


  • A graph database can be queried, visualized, and analyzed by business users to discover fraud schemes, especially for the fraudulent activities that try to hide through complex network structures, which could be quite a difficult task in a conventional data structure.
  • Incorporating relationship information and adding these potential predictive features into the current machine learning pipeline can improve model performance, especially for scenarios in that multiple parties are involved in fraudulent activities.


Insurance Fraud Detection with Graph Analytics was originally published in Towards AI on Medium, where people are continuing the conversation by highlighting and responding to this story.

Join thousands of data leaders on the AI newsletter. It’s free, we don’t spam, and we never share your email address. Keep up to date with the latest work in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI

Feedback ↓