A comprehensive end-to-end machine learning pipeline for credit card fraud detection with high accuracy (~99.6%). This project includes the complete workflow from data exploration to model deployment with an integrated CI/CD pipeline.
This project implements a machine learning system to detect fraudulent credit card transactions. It uses a custom-built fraud detection model trained on transaction data that includes temporal patterns, merchant information, and customer demographics.
Key features of this pipeline:
- Data preprocessing with custom transformations for categorical and temporal features
- Exploratory data analysis revealing insights into fraud patterns
- Model training and evaluation with high performance metrics
- Complete CI pipeline for automated testing and deployment
- Comprehensive logging system for tracking model performance and issues
- Visualization tools for understanding model decisions and fraud patterns
Our exploratory data analysis revealed several important patterns:
-
Age-Based Vulnerability: People over 50 years old tend to be more vulnerable to fraud compared to younger age groups.
-
Temporal Patterns: Fraud rates vary significantly by hour of day and day of week, with late night - early morning showing higher risk.
-
Geographic Hotspots: Large cities like Washington, New York, and Los Angeles have the highest number of fraudulent transactions.
-
Transaction Categories: Shopping and groceries are the transaction types showing higher fraud rates than others.
- Clone the Repository:
git clone https://github.com/ol1g3/fraud-detection-ML-pipeline.git cd fraud-detection-ML-pipeline - Set Up the Virtual Environment: Choose one of the methods below.
- Create the Virtual Environment (Mac):
python3 -m venv .venv
- Activate the Virtual Environment:
source .venv/bin/activate pip install -r requirements.txt - Deactivate the Virtual Environment (when done):
deactivate
- Create the Virtual Environment:
uv venv --python 3.11
- Activate the Virtual Environment And Install Requirements:
source .venv/bin/activate uv pip install -r requirements.txt - Deactivate the Virtual Environment (when done):
deactivate
This project includes a complete CI/CT pipeline that:
- Runs automated tests on every commit, including build, unit tests
- Validates model performance on test data (regression test)
- Generates performance reports and logs
- Implement neural network-based approaches for potentially higher accuracy
- Add more sophisticated feature engineering based on domain knowledge
The dataset used for this project can be found on Kaggle: Fraud Detection Dataset.