This project focuses on detecting fraudulent transactions using a dataset of financial transactions. The goal is to build models that accurately classify transactions as fraudulent or non-fraudulent based on several features. A range of classification algorithms are employed, and the performance of the models is evaluated through metrics such as AUC-ROC, precision, and recall.
- Project Overview
- Data Cleaning and Preprocessing
- Feature Engineering
- Modeling
- Results
- Technologies Used
- How to Run
- Contributing
- Contact
-
Missing Values: The dataset was checked for any missing values. There were no missing values, so no imputation was required.
-
Outlier Detection: Outliers in the
amountcolumn were detected using the IQR method. Although outliers were identified, they were not removed since they can be crucial in fraud detection. -
Multicollinearity: Multicollinearity was checked using the Variance Inflation Factor (VIF). High collinearity was addressed by creating new features such as
Actual_amount_origandActual_amount_dest. -
Encoding: Categorical data such as the
typecolumn was encoded using label encoding, while features likenameOrigandnameDestwere excluded to avoid leakage.
Feature engineering was performed to derive useful attributes from the existing dataset. New features such as:
- Actual_amount_orig: The difference between
oldbalanceOrgandnewbalanceOrig. - Actual_amount_dest: The difference between
oldbalanceDestandnewbalanceDest.
These features were designed to enhance the predictive power of the model while reducing multicollinearity.
Multiple models were built to predict fraud, including:
-
Decision Tree Classifier:
- A simple decision tree was used as a baseline model. The tree model was able to capture relationships between features but showed limitations in performance.
-
Random Forest Classifier:
- A Random Forest classifier was implemented as the primary model. It provided better generalization by combining multiple decision trees and reducing variance.
-
Handling Imbalanced Data: The Synthetic Minority Oversampling Technique (SMOTE) was applied to deal with class imbalance, ensuring the model could effectively detect fraud despite it being a minority class.
The results from the models include:
-
Decision Tree Classifier:
- Accuracy: 97.24%
- AUC-ROC: 0.97
-
Random Forest Classifier:
- Accuracy: 97.33%
- AUC-ROC: 0.97
Key Features: Feature importance analysis revealed that Actual_amount_orig, Actual_amount_dest, and step were the most important factors in predicting fraudulent transactions.
- Python: For data manipulation and modeling.
- Pandas: For data cleaning and feature engineering.
- Scikit-learn: For model building, evaluation, and SMOTE implementation.
- Matplotlib & Seaborn: For data visualization and model performance analysis.
- Jupyter Notebook: For running the analysis and visualizing results.
- Clone the repository:
git clone https://github.com/AmaanP314/fraud-detection-system.git cd fraud-detection-system - Install the required dependencies:
pip install -r requirements.txt
- Run the Jupyter notebook to perform the analysis:
jupyter notebook fraud_detection.ipynb
Contributions are welcome! Feel free to fork the repository and submit a pull request with any suggestions or improvements.
Amaan Poonawala - GitHub | LinkedIn
Feel free to reach out for any questions or feedback.