sauravsingla08's picture
Updated README.md
4a67e92 verified
---
tags:
- gaussian-mixture
- imbalance
- classification
- scikit-learn
- streamlit
library_name: scikit-learn
datasets:
- breast-cancer
- creditcard
- adult
language: en
license: mit
---
# πŸ” Gaussian Mixture Model (GMM) for Imbalanced Classification
This project implements a Gaussian Mixture Model (GMM)-based classifier designed to handle **extremely imbalanced classification problems**. It simulates real-world imbalance scenarios and benchmarks against 3 public datasets.
---
## 🧠 Problem Statement
Many real-world classification tasks (e.g., fraud detection, rare disease diagnosis) suffer from **minority class scarcity**. Classical ML methods often fail due to biased decision boundaries.
This project demonstrates how **GMM-based generative classifiers**, when combined with intelligent **imbalance handling** (e.g., undersampling), can improve minority class detection β€” especially in low-data regimes.
---
## πŸ§ͺ Datasets Used
1. **Breast Cancer Wisconsin Dataset** (`sklearn.datasets.load_breast_cancer`)
2. **Credit Card Fraud Detection** ([OpenML 42175](https://www.openml.org/d/42175))
3. **Adult Income Dataset** ([OpenML 1590](https://www.openml.org/d/1590))
---
## πŸ“Š Key Features
- πŸ” GMM classifier per class
- βš–οΈ Controlled imbalance sampling
- πŸ“Š Evaluation: F1-macro, balanced accuracy
- πŸ§ͺ Multi-dataset benchmark
- πŸš€ Hugging Face integration for model sharing
---
## πŸš€ Usage
### πŸ”§ Install dependencies
```bash
pip install -r requirements.txt
```
### ▢️ Run benchmark
```bash
python benchmark.py
```
### πŸ“ˆ Output
- Classification report
- Confusion matrix
- Balanced accuracy
- F1-score (macro)
---
## πŸ“‚ Project Structure
```bash
gmm-minority-classification/
β”œβ”€β”€ gmm_classifier.py # GMM model logic
β”œβ”€β”€ data_loader.py # Dataset loaders (3 total)
β”œβ”€β”€ imbalance_sampler.py # Undersampling function
β”œβ”€β”€ benchmark.py # Multi-dataset test harness
β”œβ”€β”€ evaluate.py # Metric evaluation functions
β”œβ”€β”€ push_to_huggingface.py # Upload model to HF hub
β”œβ”€β”€ requirements.txt
β”œβ”€β”€ README.md
└── .gitignore
```
---
## πŸ“š Research & Citations
We built this based on the following key research works:
### GMM & Probabilistic Models
1. Dempster et al. (1977) β€” Maximum likelihood via EM
2. Bishop, C. (2006) β€” *Pattern Recognition and Machine Learning*
3. McLachlan & Peel (2000) β€” *Finite Mixture Models*
4. Reynolds et al. (2009) β€” Gaussian Mixture Modeling for Classification
5. Bouveyron et al. (2007) β€” High-dimensional GMM classification
### Imbalanced Classification
6. Chawla et al. (2002) β€” SMOTE
7. He & Garcia (2009) β€” Learning from Imbalanced Data
8. Japkowicz (2000) β€” The Class Imbalance Problem: A Historical Perspective
9. Buda et al. (2018) β€” A systematic study of class imbalance
10. Liu et al. (2009) β€” EasyEnsemble and BalanceCascade
### Evaluation Metrics
11. Sokolova & Lapalme (2009) β€” A systematic analysis of performance measures
12. Van Rijsbergen (1979) β€” Information Retrieval (F-measure origin)
### Dataset Papers
13. Dua & Graff (2019) β€” UCI Machine Learning Repository
14. Lichman (2013) β€” Adult Dataset
15. Dal Pozzolo et al. (2015) β€” Credit Card Fraud Dataset
### Recent Works & Variants
16. Loquercio et al. (2020) β€” Generative Models for Anomaly Detection
17. Roy et al. (2022) β€” GMM on Tabular Data
18. Fuchs et al. (2023) β€” Robust GMM Variants
19. Ren et al. (2023) β€” Mixture of Experts for Class Imbalance
20. Guo et al. (2021) β€” Bayesian GMMs in Skewed Data
21. Cao et al. (2021) β€” Confidence-aware GMMs
22. Wang et al. (2023) β€” Deep Mixture Models for Rare Class Learning
23. Han et al. (2022) β€” Label Noise and GMM
24. Kim et al. (2022) β€” Hybrid GMM for Multi-Class Tabular Data
25. Cortes et al. (2025) β€” Margin-aware Mixture Models
---
## πŸ€— Push to Hugging Face
To publish the trained GMM:
```bash
huggingface-cli login
python push_to_huggingface.py
```
You can also use:
```bash
huggingface-cli repo create gmm-imbalance-model --type=model
```
---
## πŸ™Œ Authors
**Saurav Singla**
πŸ“¬ [github.com/sauravsingla](https://github.com/sauravsingla)
---
---
## πŸ“¦ Pretrained GMM Model
We provide a pretrained Gaussian Mixture Model as `gmm_pretrained_model.pkl` inside this repository.
### πŸ”§ Load Model in Python
```python
from joblib import load
# Load from local file
model_bundle = load("gmm_pretrained_model.pkl")
scaler = model_bundle["scaler"]
model_0 = model_bundle["model_0"]
model_1 = model_bundle["model_1"]
# Predict
X_scaled = scaler.transform(X_new)
score_0 = model_0.score_samples(X_scaled)
score_1 = model_1.score_samples(X_scaled)
y_pred = (score_1 > score_0).astype(int)
```
### 🌐 Load from Hugging Face
```python
from huggingface_hub import hf_hub_download
model_path = hf_hub_download(repo_id="YOUR_USERNAME/gmm-imbalance-model", filename="gmm_pretrained_model.pkl")
model_bundle = load(model_path)
```
---