sauravsingla08's picture
Updated README.md
4a67e92 verified
metadata
tags:
  - gaussian-mixture
  - imbalance
  - classification
  - scikit-learn
  - streamlit
library_name: scikit-learn
datasets:
  - breast-cancer
  - creditcard
  - adult
language: en
license: mit

πŸ” Gaussian Mixture Model (GMM) for Imbalanced Classification

This project implements a Gaussian Mixture Model (GMM)-based classifier designed to handle extremely imbalanced classification problems. It simulates real-world imbalance scenarios and benchmarks against 3 public datasets.


🧠 Problem Statement

Many real-world classification tasks (e.g., fraud detection, rare disease diagnosis) suffer from minority class scarcity. Classical ML methods often fail due to biased decision boundaries.

This project demonstrates how GMM-based generative classifiers, when combined with intelligent imbalance handling (e.g., undersampling), can improve minority class detection β€” especially in low-data regimes.


πŸ§ͺ Datasets Used

  1. Breast Cancer Wisconsin Dataset (sklearn.datasets.load_breast_cancer)
  2. Credit Card Fraud Detection (OpenML 42175)
  3. Adult Income Dataset (OpenML 1590)

πŸ“Š Key Features

  • πŸ” GMM classifier per class
  • βš–οΈ Controlled imbalance sampling
  • πŸ“Š Evaluation: F1-macro, balanced accuracy
  • πŸ§ͺ Multi-dataset benchmark
  • πŸš€ Hugging Face integration for model sharing

πŸš€ Usage

πŸ”§ Install dependencies

pip install -r requirements.txt

▢️ Run benchmark

python benchmark.py

πŸ“ˆ Output

  • Classification report
  • Confusion matrix
  • Balanced accuracy
  • F1-score (macro)

πŸ“‚ Project Structure

gmm-minority-classification/
β”œβ”€β”€ gmm_classifier.py        # GMM model logic
β”œβ”€β”€ data_loader.py           # Dataset loaders (3 total)
β”œβ”€β”€ imbalance_sampler.py     # Undersampling function
β”œβ”€β”€ benchmark.py             # Multi-dataset test harness
β”œβ”€β”€ evaluate.py              # Metric evaluation functions
β”œβ”€β”€ push_to_huggingface.py   # Upload model to HF hub
β”œβ”€β”€ requirements.txt
β”œβ”€β”€ README.md
└── .gitignore

πŸ“š Research & Citations

We built this based on the following key research works:

GMM & Probabilistic Models

  1. Dempster et al. (1977) β€” Maximum likelihood via EM
  2. Bishop, C. (2006) β€” Pattern Recognition and Machine Learning
  3. McLachlan & Peel (2000) β€” Finite Mixture Models
  4. Reynolds et al. (2009) β€” Gaussian Mixture Modeling for Classification
  5. Bouveyron et al. (2007) β€” High-dimensional GMM classification

Imbalanced Classification

  1. Chawla et al. (2002) β€” SMOTE
  2. He & Garcia (2009) β€” Learning from Imbalanced Data
  3. Japkowicz (2000) β€” The Class Imbalance Problem: A Historical Perspective
  4. Buda et al. (2018) β€” A systematic study of class imbalance
  5. Liu et al. (2009) β€” EasyEnsemble and BalanceCascade

Evaluation Metrics

  1. Sokolova & Lapalme (2009) β€” A systematic analysis of performance measures
  2. Van Rijsbergen (1979) β€” Information Retrieval (F-measure origin)

Dataset Papers

  1. Dua & Graff (2019) β€” UCI Machine Learning Repository
  2. Lichman (2013) β€” Adult Dataset
  3. Dal Pozzolo et al. (2015) β€” Credit Card Fraud Dataset

Recent Works & Variants

  1. Loquercio et al. (2020) β€” Generative Models for Anomaly Detection
  2. Roy et al. (2022) β€” GMM on Tabular Data
  3. Fuchs et al. (2023) β€” Robust GMM Variants
  4. Ren et al. (2023) β€” Mixture of Experts for Class Imbalance
  5. Guo et al. (2021) β€” Bayesian GMMs in Skewed Data
  6. Cao et al. (2021) β€” Confidence-aware GMMs
  7. Wang et al. (2023) β€” Deep Mixture Models for Rare Class Learning
  8. Han et al. (2022) β€” Label Noise and GMM
  9. Kim et al. (2022) β€” Hybrid GMM for Multi-Class Tabular Data
  10. Cortes et al. (2025) β€” Margin-aware Mixture Models

πŸ€— Push to Hugging Face

To publish the trained GMM:

huggingface-cli login
python push_to_huggingface.py

You can also use:

huggingface-cli repo create gmm-imbalance-model --type=model

πŸ™Œ Authors

Saurav Singla
πŸ“¬ github.com/sauravsingla



πŸ“¦ Pretrained GMM Model

We provide a pretrained Gaussian Mixture Model as gmm_pretrained_model.pkl inside this repository.

πŸ”§ Load Model in Python

from joblib import load

# Load from local file
model_bundle = load("gmm_pretrained_model.pkl")
scaler = model_bundle["scaler"]
model_0 = model_bundle["model_0"]
model_1 = model_bundle["model_1"]

# Predict
X_scaled = scaler.transform(X_new)
score_0 = model_0.score_samples(X_scaled)
score_1 = model_1.score_samples(X_scaled)
y_pred = (score_1 > score_0).astype(int)

🌐 Load from Hugging Face

from huggingface_hub import hf_hub_download
model_path = hf_hub_download(repo_id="YOUR_USERNAME/gmm-imbalance-model", filename="gmm_pretrained_model.pkl")
model_bundle = load(model_path)