Updated README.md

4a67e92 verified about 2 months ago

5.05 kB

	---
	tags:
	- gaussian-mixture
	- imbalance
	- classification
	- scikit-learn
	- streamlit
	library_name: scikit-learn
	datasets:
	- breast-cancer
	- creditcard
	- adult
	language: en
	license: mit
	---

	# 🔍 Gaussian Mixture Model (GMM) for Imbalanced Classification

	This project implements a Gaussian Mixture Model (GMM)-based classifier designed to handle extremely imbalanced classification problems. It simulates real-world imbalance scenarios and benchmarks against 3 public datasets.

	---

	## 🧠 Problem Statement

	Many real-world classification tasks (e.g., fraud detection, rare disease diagnosis) suffer from minority class scarcity. Classical ML methods often fail due to biased decision boundaries.

	This project demonstrates how GMM-based generative classifiers, when combined with intelligent imbalance handling (e.g., undersampling), can improve minority class detection — especially in low-data regimes.

	---

	## 🧪 Datasets Used

	1. Breast Cancer Wisconsin Dataset (`sklearn.datasets.load_breast_cancer`)
	2. Credit Card Fraud Detection ([OpenML 42175](https://www.openml.org/d/42175))
	3. Adult Income Dataset ([OpenML 1590](https://www.openml.org/d/1590))

	---

	## 📊 Key Features

	- 🔁 GMM classifier per class
	- ⚖️ Controlled imbalance sampling
	- 📊 Evaluation: F1-macro, balanced accuracy
	- 🧪 Multi-dataset benchmark
	- 🚀 Hugging Face integration for model sharing

	---

	## 🚀 Usage

	### 🔧 Install dependencies

	```bash
	pip install -r requirements.txt
	```

	### ▶️ Run benchmark

	```bash
	python benchmark.py
	```

	### 📈 Output

	- Classification report
	- Confusion matrix
	- Balanced accuracy
	- F1-score (macro)

	---

	## 📂 Project Structure

	```bash
	gmm-minority-classification/
	├── gmm_classifier.py # GMM model logic
	├── data_loader.py # Dataset loaders (3 total)
	├── imbalance_sampler.py # Undersampling function
	├── benchmark.py # Multi-dataset test harness
	├── evaluate.py # Metric evaluation functions
	├── push_to_huggingface.py # Upload model to HF hub
	├── requirements.txt
	├── README.md
	└── .gitignore
	```

	---

	## 📚 Research & Citations

	We built this based on the following key research works:

	### GMM & Probabilistic Models
	1. Dempster et al. (1977) — Maximum likelihood via EM
	2. Bishop, C. (2006) — Pattern Recognition and Machine Learning
	3. McLachlan & Peel (2000) — Finite Mixture Models
	4. Reynolds et al. (2009) — Gaussian Mixture Modeling for Classification
	5. Bouveyron et al. (2007) — High-dimensional GMM classification

	### Imbalanced Classification
	6. Chawla et al. (2002) — SMOTE
	7. He & Garcia (2009) — Learning from Imbalanced Data
	8. Japkowicz (2000) — The Class Imbalance Problem: A Historical Perspective
	9. Buda et al. (2018) — A systematic study of class imbalance
	10. Liu et al. (2009) — EasyEnsemble and BalanceCascade

	### Evaluation Metrics
	11. Sokolova & Lapalme (2009) — A systematic analysis of performance measures
	12. Van Rijsbergen (1979) — Information Retrieval (F-measure origin)

	### Dataset Papers
	13. Dua & Graff (2019) — UCI Machine Learning Repository
	14. Lichman (2013) — Adult Dataset
	15. Dal Pozzolo et al. (2015) — Credit Card Fraud Dataset

	### Recent Works & Variants
	16. Loquercio et al. (2020) — Generative Models for Anomaly Detection
	17. Roy et al. (2022) — GMM on Tabular Data
	18. Fuchs et al. (2023) — Robust GMM Variants
	19. Ren et al. (2023) — Mixture of Experts for Class Imbalance
	20. Guo et al. (2021) — Bayesian GMMs in Skewed Data
	21. Cao et al. (2021) — Confidence-aware GMMs
	22. Wang et al. (2023) — Deep Mixture Models for Rare Class Learning
	23. Han et al. (2022) — Label Noise and GMM
	24. Kim et al. (2022) — Hybrid GMM for Multi-Class Tabular Data
	25. Cortes et al. (2025) — Margin-aware Mixture Models

	---

	## 🤗 Push to Hugging Face

	To publish the trained GMM:

	```bash
	huggingface-cli login
	python push_to_huggingface.py
	```

	You can also use:
	```bash
	huggingface-cli repo create gmm-imbalance-model --type=model
	```

	---

	## 🙌 Authors

	Saurav Singla
	📬 [github.com/sauravsingla](https://github.com/sauravsingla)

	---



	---

	## 📦 Pretrained GMM Model

	We provide a pretrained Gaussian Mixture Model as `gmm_pretrained_model.pkl` inside this repository.

	### 🔧 Load Model in Python

	```python
	from joblib import load

	# Load from local file
	model_bundle = load("gmm_pretrained_model.pkl")
	scaler = model_bundle["scaler"]
	model_0 = model_bundle["model_0"]
	model_1 = model_bundle["model_1"]

	# Predict
	X_scaled = scaler.transform(X_new)
	score_0 = model_0.score_samples(X_scaled)
	score_1 = model_1.score_samples(X_scaled)
	y_pred = (score_1 > score_0).astype(int)
	```

	### 🌐 Load from Hugging Face

	```python
	from huggingface_hub import hf_hub_download
	model_path = hf_hub_download(repo_id="YOUR_USERNAME/gmm-imbalance-model", filename="gmm_pretrained_model.pkl")
	model_bundle = load(model_path)
	```

	---