Insurance Fraud Detection with Graph Analytics
Last Updated on July 31, 2022 by Editorial Team
Author(s): Tony Zhang
Originally published on Towards AI the World’s Leading AI and Technology News and Media Company. If you are building an AI-related product or service, we invite you to consider becoming an AI sponsor. At Towards AI, we help scale AI and technology startups. Let us help you unleash your technology to the masses.
Explore graph database and extract graph features to augment the machine learningΒ model
Insurance fraud is an enormous problem, and the insurance industry has been fighting fraud for a very long time. Letβs see the following headline new and statistics:
Health insurance fraud is a type of white-collar activity wherein dishonest claims are filed to gain a profit. The fraudulent activity may involve peers of providers, physicians, and beneficiaries acting together to make fraud claims. Hence, it is difficult to detect fraudulent activities given the complexity of the parties involved. The anomalous behavior could be a fraud ring where providers, doctors, and clients file small claims over time. Such small claims can be difficult to detect with traditional tools, which look at the claims individually and cannot uncover the connections between the variousΒ players.
A graph database is designed to fill the gap by storing and handling the highly connected data where the relationship is as important as the individual data point. Hence, we can utilize graph analytics to understand the relationships. In this article, you will learnΒ :
- Graph basic
- Graph search and query to understand the relationships
- Augment machine learning model with graphΒ features
Graph basic
What isΒ graph?
In mathematics, and more specifically in graph theory, a graph is a structure amounting to a set of objects in which some pairs of the objects are in some sense βrelatedβ. The objects correspond to mathematical abstractions called vertices (also called nodes or points) and each of the related pairs of vertices is called an edge (also called link orΒ line).
Simply put, a graph is a mathematical representation of any type of networkΒ with:
- Vertex, sometimes called a node, in the insurance context, can be:
– Claim
– Policyholder - Edge is the relationship/interaction/behavior between the nodes:
– Policyholder of claim
– Insured ofΒ claim
Graph algorithm concepts will be introduced in the third chapterβββAugment machine learning model with graph features.
Graph search andΒ query
Graph database is purpose-built to store and navigate the relationships utilizing the connectivity between vertices. End users donβt need to execute countless joins as the graph query language is about pattern matching between nodes. This is more natural to work with and can be easily used by business users whose feedback can be fed into the fraud detection system.
Graph visualization through the graph query language helps to analyze a large amount of data and identify patterns that indicate fraudulent activities. In this section, I will share a few scenarios, and the visualizations are basedΒ on:
- Source of data: Analyzing insurance claims using IBM Db2Β graph
- Graph database: NebulaΒ Graph
Fraud signalΒ warning
Find all claims filed by the policyholder of fraudulent claim βC4377β and show the diseases of claim βC4377βΒ patient.
To deep dive into this policyholder(PH3759), we see that this person saw different doctors at different providers, which is notΒ normal.
Policyholder connections related to the fraudulent claim
The graph shows the connections of the high-risk profile that has the fraud claim βC4377β. We see one high-risk policyholder in the 1st-degree connection and another high-risk policyholder in the 3rd-degree connection.
Augment machine learning model with graphΒ features
Feature engineering is the art of formulating useful features from existing data. The definition is:
Feature engineering is the process of transforming raw data into features that better represent the underlying problem to the predictive models, resulting in improved model accuracy on unseenΒ data.
Many machine learning applications typically rely on tabular data models and omit the relationship & contextual data, which are the strongest predictors of behavior. This is crucial as each individual claim/provider may look legitimate by avoiding redΒ flags.
In this section, I will walk through a concrete example of how to transform the tabular data into a graph and extract graph features to augment the machine learning model. The overall approachΒ is:
- Ingest tabular data into graph structure inΒ Python
- Feature engineering on the graphΒ data
- Incorporate graph features into the machine learningΒ pipeline
Caveats:
- we are predicting whether a claim is fraudulent or not based on the providerβs fraudulent flag
- the idea is to show the process of creating potential predictive features and augment the current machine learningΒ pipeline
Ingest Tabular Data into Graph Structure
In this section, I willΒ use:
- Source of data: HEALTHCARE PROVIDER FRAUD DETECTION ANALYSIS
- Python graph package:Β iGraph
The dataset is from Medicare and consists of claims filed by the providers as well as information about the beneficiary for everyΒ claim:
- Patient-related features: age, gender, location, health conditions, etc.
- Claim-related features: start & end date, claim amount, diagnosis code, procedure code, provider, attending physician, operating physician, etc.
Nodes in this article are providers and attending physicians who are source and target, respectively.
# import package
from igraph import *
# use time-based split to avoid data leakage
Train_Inpatient_G = Train_Inpatient[Train_Inpatient['ClaimStartDt'] < '2009-10-01'][['Provider', 'AttendingPhysician', 'ClaimID']].reset_index(drop = True)
Train_Outpatient_G = Train_Outpatient[Train_Outpatient['ClaimStartDt'] < '2009-10-01'][['Provider', 'AttendingPhysician', 'ClaimID']].reset_index(drop = True)
# merge inpatient and outpatient data
G_df = pd.concat([Train_Inpatient_G, Train_Outpatient_G], ignore_index = True, sort = False)
G_df.shape
# implement edge with weight
G_df = G_df[['Provider', 'AttendingPhysician', 'ClaimID']]
G_df = G_df.groupby(['Provider', 'AttendingPhysician']).size().to_frame('Weight').reset_index()
G_df.shape
source = 'Provider'
target = 'AttendingPhysician'
weight = 'Weight'
G_df[source] = G_df[source].astype(str)
G_df[target] = G_df[target].astype(str)
G_df.shape
# create graph from dataframe
G = Graph.DataFrame(G_df, directed=False)
Feature engineering on graphΒ data
Using graph algorithms, I create new potential meaningful features, such as connectivity metrics and clustering features based on the relationship. In order to prevent target leakage, I use the time-based split to generate graph features. In this section, I will provide code for graph features:
Degree
The node can be defined as the number of edges incident to aΒ node.
degree = pd.DataFrame({'Node': G.vs["name"],
'Degree': G.strength(weights = G.es['Weight'])})
degree.shape
Closeness
In a connected graph, closeness centrality (or closeness) of a node is a measure of centrality in a network, calculated as the reciprocal of the sum of the length of the shortest paths between the node and all other nodes in theΒ graph.
closeness = pd.DataFrame({'Node': G.vs["name"],
'Closeness': G.closeness(weights = G.es['Weight'])})
closeness.shape
Infomap
Infomap is a graph clustering algorithm capable of achieving high-quality communities.
communities_infomap = pd.DataFrame({'Node': G.vs["name"],
'communities_infomap': G.community_infomap(edge_weights = G.es['Weight']).membership})
communities_infomap.shape
The resulting metrics from the graph algorithms are transformed into a table so that the generated features can be used in a machine learningΒ model.
# merge graph features
graph_feature = [degree, closeness, eigenvector_centrality, pagerank, communities_infomap, community_multilevel]
graph_feature = reduce(lambda left,right: pd.merge(left,right, how = 'left',on='Node'), graph_feature)
Incorporate graph features into the machine learningΒ pipeline
The graph features are merged into the raw model-ready data, and the time-based split is utilized to prepare the data for the machine learningΒ model.
train = Final_Dataset_Train[Final_Dataset_Train['ClaimStartDt'] < '2009-10-01'].reset_index(drop = True).drop('ClaimStartDt', axis = 1)
print(train.shape)
test = Final_Dataset_Train[Final_Dataset_Train['ClaimStartDt'] >= '2009-10-01'].reset_index(drop = True).drop('ClaimStartDt', axis = 1)
print(test.shape)
x_tr = train.drop(axis=1,columns=['PotentialFraud'])
y_tr = train['PotentialFraud']
x_val = test.drop(axis=1,columns=['PotentialFraud'])
y_val = test['PotentialFraud']
In this article, I explored 6 scenarios with the combination of 2 algorithms and 3 featureΒ spaces:
- Algorithm:
– Logistic Regression
– RandomΒ Forest - Feature space:
– original features
– original features + node level features
– original features + node level features + clustering features
lr = LogisticRegression(penalty='none', solver='saga', random_state=42, n_jobs=-1)
rf = RandomForestClassifier(n_estimators=300, max_depth=5, min_samples_leaf=50,
max_features=0.3, random_state=42, n_jobs=-1)
lr.fit(x_tr, y_tr)
rf.fit(x_tr, y_tr)
preds_lr = lr.predict_proba(x_val)[:,1]
preds_rf = rf.predict_proba(x_val)[:,1]
Here I use AUC as the evaluation metric on the test set and the ROC curve isΒ below:
Generally, as the plot above, we can summarize:
- Outperformance by the model with graphΒ features
- Outperformance by Random Forest with node level and clustering features
In this article, I didnβt dive deep into model interpretation and model diagnosis. Earlier, I wrote an article on model interpretationβββDeep Learning Model Interpretation Using SHAP, which you may be interested in.
Conclusion
- A graph database can be queried, visualized, and analyzed by business users to discover fraud schemes, especially for the fraudulent activities that try to hide through complex network structures, which could be quite a difficult task in a conventional data structure.
- Incorporating relationship information and adding these potential predictive features into the current machine learning pipeline can improve model performance, especially for scenarios in that multiple parties are involved in fraudulent activities.
Reference
https://www.iii.org/article/background-on-insurance-fraud
https://en.wikipedia.org/wiki/Graph_(discrete_mathematics)
https://github.com/IBM/analyzing-insurance-claims-using-ibm-db2-graph
https://github.com/vesoft-inc/nebula
https://www.kaggle.com/datasets/rohitrox/healthcare-provider-fraud-detection-analysis
https://en.wikipedia.org/wiki/Degree_(graph_theory)
https://en.wikipedia.org/wiki/Closeness_centrality
https://towardsdatascience.com/deep-learning-model-interpretation-using-shap-a21786e91d16
Insurance Fraud Detection with Graph Analytics was originally published in Towards AI on Medium, where people are continuing the conversation by highlighting and responding to this story.
Join thousands of data leaders on the AI newsletter. Itβs free, we donβt spam, and we never share your email address. Keep up to date with the latest work in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming aΒ sponsor.
Published via Towards AI