Skip to content

Anomaly Detection

This project demonstrates an end-to-end Machine Learning solution for detecting fraudulent credit card transactions. It encompasses data preprocessing, model training and optimization, and deployment as a containerized RESTful API. The goal is to identify anomalous transactions that might indicate fraud, leveraging a real-world imbalanced dataset.

Anomaly Detection cover

Project Brief

This project detects fraudulent credit card transactions using an end-to-end ML workflow: preprocessing, model training/optimization, and deployment as a containerized REST API. It targets real-world class imbalance and optimizes the precision/recall trade-off for the fraud class.

Project Duration (Estimate)

Part-time (evenings/weekends): 6–8 weeks

Repositories

Approach

  • Planning & Problem Framing
    • Defined the goal: detect fraudulent transactions with high recall while keeping precision practical for review teams
    • Identified constraints: severe class imbalance, limited interpretability, need for real-time inference
  • Data Preparation
    • Loaded the public credit-card dataset; separated train/validation/test splits
    • Scaled key features (Amount, Time) and preserved the anonymized V1–V28 components as-is
    • Applied stratified splits to maintain class ratios across sets
  • Modeling
    • Started with baseline (Logistic Regression) → moved to RandomForest for non-linear boundaries
    • Handled imbalance with class_weight and careful cross-validation
    • Tracked metrics beyond accuracy: ROC-AUC, PR-AUC, Precision/Recall/F1 on the fraud class
  • Threshold Tuning
    • Optimized the decision threshold for the fraud class (maximize F1 while guarding precision)
    • Validated the chosen threshold on a hold-out set to avoid optimistic bias
  • API & Contracts
    • Exported the trained model + scalers with joblib
    • Designed FastAPI schemas (Pydantic) for single/batch prediction with strict validation
    • Exposed `/predict` and documented with Swagger UI & ReDoc
  • Packaging & Deployment
    • Containerized the service with Docker for reproducible local and cloud runs
    • Environment-driven config for thresholds, model paths, and log levels
  • Testing & QA
    • Smoke tests for API routes and schema errors (invalid/missing fields)
    • Metric checks to ensure degradation doesn’t slip through (spot-check F1/precision/recall)
  • Monitoring & Next Steps
    • Baseline logging for predictions and errors; plan for drift checks on score distributions
    • Future: model retraining pipeline, alerting on metric drops, feature importance reports
  • Project Duration (Estimate)
    • Part-time (evenings/weekends): 6–8 weeks for MVP (data → model → API → Docker)
    • Add 2–4 weeks for monitoring, retraining workflow, and CI hardening

Features

  • Prediction API: FastAPI endpoints for single & batch scoring
  • Interactive Docs: Swagger UI (/docs) and ReDoc (/redoc)
  • Model Artifacts: joblib-exported model and scalers
  • Threshold Tuning: calibrated decision threshold for fraud class
  • Validation: Pydantic schemas with robust error responses
  • Containerization: Docker image for easy run/deploy
  • CI: optional GitHub Actions for lint/build/test

Tools & Technologies

Python 3.10scikit-learnpandasnumpyjoblibFastAPIuvicornpydanticDockerGitGitHubJupyter Notebook