metadata

tags:
  - gaussian-mixture
  - imbalance
  - classification
  - scikit-learn
  - streamlit
library_name: scikit-learn
datasets:
  - breast-cancer
  - creditcard
  - adult
language: en
license: mit

🔍 Gaussian Mixture Model (GMM) for Imbalanced Classification

This project implements a Gaussian Mixture Model (GMM)-based classifier designed to handle extremely imbalanced classification problems. It simulates real-world imbalance scenarios and benchmarks against 3 public datasets.

🧠 Problem Statement

Many real-world classification tasks (e.g., fraud detection, rare disease diagnosis) suffer from minority class scarcity. Classical ML methods often fail due to biased decision boundaries.

This project demonstrates how GMM-based generative classifiers, when combined with intelligent imbalance handling (e.g., undersampling), can improve minority class detection — especially in low-data regimes.

🧪 Datasets Used

Breast Cancer Wisconsin Dataset (sklearn.datasets.load_breast_cancer)
Credit Card Fraud Detection (OpenML 42175)
Adult Income Dataset (OpenML 1590)

📊 Key Features

🔁 GMM classifier per class
⚖️ Controlled imbalance sampling
📊 Evaluation: F1-macro, balanced accuracy
🧪 Multi-dataset benchmark
🚀 Hugging Face integration for model sharing

🚀 Usage

🔧 Install dependencies

pip install -r requirements.txt

▶️ Run benchmark

python benchmark.py

📈 Output

Classification report
Confusion matrix
Balanced accuracy
F1-score (macro)

📂 Project Structure

gmm-minority-classification/
├── gmm_classifier.py        # GMM model logic
├── data_loader.py           # Dataset loaders (3 total)
├── imbalance_sampler.py     # Undersampling function
├── benchmark.py             # Multi-dataset test harness
├── evaluate.py              # Metric evaluation functions
├── push_to_huggingface.py   # Upload model to HF hub
├── requirements.txt
├── README.md
└── .gitignore

📚 Research & Citations

We built this based on the following key research works:

GMM & Probabilistic Models

Dempster et al. (1977) — Maximum likelihood via EM
Bishop, C. (2006) — Pattern Recognition and Machine Learning
McLachlan & Peel (2000) — Finite Mixture Models
Reynolds et al. (2009) — Gaussian Mixture Modeling for Classification
Bouveyron et al. (2007) — High-dimensional GMM classification

Imbalanced Classification

Chawla et al. (2002) — SMOTE
He & Garcia (2009) — Learning from Imbalanced Data
Japkowicz (2000) — The Class Imbalance Problem: A Historical Perspective
Buda et al. (2018) — A systematic study of class imbalance
Liu et al. (2009) — EasyEnsemble and BalanceCascade

Evaluation Metrics

Sokolova & Lapalme (2009) — A systematic analysis of performance measures
Van Rijsbergen (1979) — Information Retrieval (F-measure origin)

Dataset Papers

Dua & Graff (2019) — UCI Machine Learning Repository
Lichman (2013) — Adult Dataset
Dal Pozzolo et al. (2015) — Credit Card Fraud Dataset

Recent Works & Variants

Loquercio et al. (2020) — Generative Models for Anomaly Detection
Roy et al. (2022) — GMM on Tabular Data
Fuchs et al. (2023) — Robust GMM Variants
Ren et al. (2023) — Mixture of Experts for Class Imbalance
Guo et al. (2021) — Bayesian GMMs in Skewed Data
Cao et al. (2021) — Confidence-aware GMMs
Wang et al. (2023) — Deep Mixture Models for Rare Class Learning
Han et al. (2022) — Label Noise and GMM
Kim et al. (2022) — Hybrid GMM for Multi-Class Tabular Data
Cortes et al. (2025) — Margin-aware Mixture Models

🤗 Push to Hugging Face

To publish the trained GMM:

huggingface-cli login
python push_to_huggingface.py

You can also use:

huggingface-cli repo create gmm-imbalance-model --type=model

🙌 Authors

Saurav Singla
📬 github.com/sauravsingla

📦 Pretrained GMM Model

We provide a pretrained Gaussian Mixture Model as gmm_pretrained_model.pkl inside this repository.

🔧 Load Model in Python

from joblib import load

# Load from local file
model_bundle = load("gmm_pretrained_model.pkl")
scaler = model_bundle["scaler"]
model_0 = model_bundle["model_0"]
model_1 = model_bundle["model_1"]

# Predict
X_scaled = scaler.transform(X_new)
score_0 = model_0.score_samples(X_scaled)
score_1 = model_1.score_samples(X_scaled)
y_pred = (score_1 > score_0).astype(int)

🌐 Load from Hugging Face

from huggingface_hub import hf_hub_download
model_path = hf_hub_download(repo_id="YOUR_USERNAME/gmm-imbalance-model", filename="gmm_pretrained_model.pkl")
model_bundle = load(model_path)