Data Wrangling Of Fraudulent Credit Cards
According to Wikipedia, a credit card fraud is an inclusive term for fraud committed using a payment card, such as a credit card or debit card. The purpose may be to obtain goods or services, or to make payment to another account which is controlled by a criminal.
In this article, I will be performing a data wrangling on a credit card dataset obtained from Kaggle.
Data Wrangling
It is the most important first step in data analysis. It is the process of converting or mapping data from the initial “raw” form into another format in order to prepare the data for further analysis.
The Dataset
The data contains transactions made by credit cards in September 2013 by European cardholders. This dataset presents transactions that occurred in two days and contains both fraudulent and non-fraudulent transactions.
I will start the analysis by importing the necessary libraries
import pandas as pd
import numpy as np
The dataset is stored in a csv(comma separated values) format, so i will use the “read_csv” method in pandas to read my dataset out
credit_card = pd.read_csv("creditcard.csv")
Now we can view our dataset on the screen by printing it out
credit_card
This code also returns the number of rows (284807) rows and columns (31)
To view the first 10 rows and last 10 rows of our data, we will use the head and tail method in pandas to do that.
credit_card.head(10)
credit_card.tail(10)
To see the column name, datatypes and the number of missing values in each columns, we will use the info method.
credit_card.info()
This output shows that we have no missing values in the dataset, therefore no need of dropping rows and columns. Also the datatypes of the columns are int and float, so no need of casting datatype.
To get some statistical overview of our data, we will use the describe method in pandas.
credit_card.describe()
Now, since this dataset contains both fraudulent and non-fraudulent transactions, i will like to separate them.
filt_fraudulent = credit_card["Class"] == 1
filt_normal = credit_card["Class"] == 0
To view the dataset with only fraudulent transactions, we will pass the filter function into the loc function.
fraudulent = credit_card.loc[filt_fraudulent]
fraudulent
To get the number of transactions that are fraudulent, we can do that by using the shape attribute in pandas
And this shows that we have 492 transactions that are fraudulent
To view the dataset with only non-fraudulent transactions, we will also pass the filter function into the loc function.
normal = credit_card.loc[filt_normal]
normal
To also get the number of transactions that are non-fraudulent(normal), we will also use the shape attribute to do that
normal.shape
Lastly, I want to sort my dataset according to the amount transacted
credit_card.sort_values(by="Amount", inplace=True)
credit_card
Data Visualization
After performing some basic analysis on our dataset, it will be of great idea to visualize the dataset so as to gain more insight. Here, I will be making use of the matplotlib library in python to visualize this dataset.
- Using bar-plot to view the fraudulent and normal transactions:
2. Using histogram to view the amount per transactions for the fraudulent cards and normal cards:
Here, we are going to notice that the amounts made during the fraudulent transactions are relatively small compare to the ones made during the normal transactions.
3. Using histogram to view the top 10 fraudulent transactions based on amount:
To get the real code for this analysis, you can check my github account through the link below.