Credit Card Fraud Detection
In this project, it will show anomaly detection with Unsupervised Learning. With data of card transations, it can detect whether credit card fraud is occured or not. The data is from kaggle.
- Required Packages
- Version Check
- Dataset Load
- Exploratory Data Analysis
- Preprocess Dataset
- Build the model
import sys
import numpy as np
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import sklearn
plt.rcParams['figure.figsize'] = (8, 8)
print('Python: {}'.format(sys.version))
print('Numpy: {}'.format(np.__version__))
print('Matplotlib: {}'.format(mpl.__version__))
print('Seaborn: {}'.format(sns.__version__))
print('Pandas: {}'.format(pd.__version__))
print('Scikit-learn: {}'.format(sklearn.__version__))
Dataset Load
More data information is in here
data = pd.read_csv('./dataset/creditcard.csv')
data.head()
data.describe()
print(data.columns)
print(data.shape)
data.hist(figsize=(20, 20));
fraud = data[data['Class'] == 1]
valid = data[data['Class'] == 0]
outlier_fraction = len(fraud) / float(len(valid))
print(outlier_fraction)
print('Fraud Cases: {}'.format(len(fraud)))
print('Valid Cases: {}'.format(len(valid)))
Through the EDA of dataset, Data imbalance is occurred, cause fraud cases are so small.
corrmat = data.corr()
fit = plt.figure(figsize=(12, 9))
sns.heatmap(corrmat, vmax=0.8, square=True);
columns = data.columns.tolist()
# Filter the columns to remove data we don't want
columns = [c for c in columns if c not in ['Class']]
# Store the variable we want to predict
target='Class'
X = data[columns]
y = data[target]
# Print the shape
print(X.shape)
print(y.shape)
from sklearn.metrics import classification_report, accuracy_score
from sklearn.ensemble import IsolationForest
from sklearn.neighbors import LocalOutlierFactor
From sklearn
:
LocalOutlierFactor
is Unsupervised Outlier Detection using Local Outlier Factor (LOF)
The anomaly score of each sample is called Local Outlier Factor. It measures the local deviation of density of a given sample with respect to its neighbors. It is local in that the anomaly score depends on how isolated the object is with respect to the surrounding neighborhood. More precisely, locality is given by k-nearest neighbors, whose distance is used to estimate the local density. By comparing the local density of a sample to the local densities of its neighbors, one can identify samples that have a substantially lower density than their neighbors. These are considered outliers.
And IsolationForest
‘isolates’ observations by randomly selecting a feature and then randomly selecting a split value between the maximum and minimum values of the selected feature.
Since recursive partitioning can be represented by a tree structure, the number of splittings required to isolate a sample is equivalent to the path length from the root node to the terminating node.
This path length, averaged over a forest of such random trees, is a measure of normality and our decision function.
Random partitioning produces noticeably shorter paths for anomalies. Hence, when a forest of random trees collectively produce shorter path lengths for particular samples, they are highly likely to be anomalies.
random_state = 1
# Define the outlier detection methods
classifiers = {
'Isolation Forest':IsolationForest(max_samples=len(X), contamination=outlier_fraction,
random_state=random_state),
'Local Outlier Factor':LocalOutlierFactor(n_neighbors=20, contamination=outlier_fraction)
}
n_outliers = len(fraud)
for i, (clf_name, clf) in enumerate(classifiers.items()):
# fit the model and tag outliers
if clf_name == 'Local Outlier Factor':
y_pred = clf.fit_predict(X)
scores_pred = clf.negative_outlier_factor_
else:
clf.fit(X)
y_pred = clf.predict(X)
scores_pred = clf.decision_function(X)
# Reshape the prediction values to 0 for valid, 1 for fraud
# For inlier
y_pred[y_pred == 1] = 0
# For outlier
y_pred[y_pred == -1] = 1
n_errors = (y_pred != y).sum()
# Run classification metrics
print('{}: {}'.format(clf_name, n_errors))
print(accuracy_score(y, y_pred))
print(classification_report(y, y_pred))
Recall that
-
Accuracy: $$ \dfrac{tp + tn}{tp + tn + fp + fn} $$
-
Precision (Positive Predictive Value): $$ \dfrac{tp}{tp + fp}$$
-
Recall (Sensitivity, hit rate, True Positive Rate): $$ \dfrac{tp}{tp + fn}$$
-
F1 score: Harmonic mean of precision and recall $$ 2 \cdot \dfrac{\text{precision} \cdot \text{recall}}{\text{precision} + \text{recall}} $$
-
Due to the data imbalance, accuracy of both model is very high. But comparing the precision/recall score, we found that Isolation Forest is better model than Local Outlier fraction.