|
--- |
|
tags: |
|
- gaussian-mixture |
|
- imbalance |
|
- classification |
|
- scikit-learn |
|
- streamlit |
|
library_name: scikit-learn |
|
datasets: |
|
- breast-cancer |
|
- creditcard |
|
- adult |
|
language: en |
|
license: mit |
|
--- |
|
|
|
# π Gaussian Mixture Model (GMM) for Imbalanced Classification |
|
|
|
This project implements a Gaussian Mixture Model (GMM)-based classifier designed to handle **extremely imbalanced classification problems**. It simulates real-world imbalance scenarios and benchmarks against 3 public datasets. |
|
|
|
--- |
|
|
|
## π§ Problem Statement |
|
|
|
Many real-world classification tasks (e.g., fraud detection, rare disease diagnosis) suffer from **minority class scarcity**. Classical ML methods often fail due to biased decision boundaries. |
|
|
|
This project demonstrates how **GMM-based generative classifiers**, when combined with intelligent **imbalance handling** (e.g., undersampling), can improve minority class detection β especially in low-data regimes. |
|
|
|
--- |
|
|
|
## π§ͺ Datasets Used |
|
|
|
1. **Breast Cancer Wisconsin Dataset** (`sklearn.datasets.load_breast_cancer`) |
|
2. **Credit Card Fraud Detection** ([OpenML 42175](https://www.openml.org/d/42175)) |
|
3. **Adult Income Dataset** ([OpenML 1590](https://www.openml.org/d/1590)) |
|
|
|
--- |
|
|
|
## π Key Features |
|
|
|
- π GMM classifier per class |
|
- βοΈ Controlled imbalance sampling |
|
- π Evaluation: F1-macro, balanced accuracy |
|
- π§ͺ Multi-dataset benchmark |
|
- π Hugging Face integration for model sharing |
|
|
|
--- |
|
|
|
## π Usage |
|
|
|
### π§ Install dependencies |
|
|
|
```bash |
|
pip install -r requirements.txt |
|
``` |
|
|
|
### βΆοΈ Run benchmark |
|
|
|
```bash |
|
python benchmark.py |
|
``` |
|
|
|
### π Output |
|
|
|
- Classification report |
|
- Confusion matrix |
|
- Balanced accuracy |
|
- F1-score (macro) |
|
|
|
--- |
|
|
|
## π Project Structure |
|
|
|
```bash |
|
gmm-minority-classification/ |
|
βββ gmm_classifier.py # GMM model logic |
|
βββ data_loader.py # Dataset loaders (3 total) |
|
βββ imbalance_sampler.py # Undersampling function |
|
βββ benchmark.py # Multi-dataset test harness |
|
βββ evaluate.py # Metric evaluation functions |
|
βββ push_to_huggingface.py # Upload model to HF hub |
|
βββ requirements.txt |
|
βββ README.md |
|
βββ .gitignore |
|
``` |
|
|
|
--- |
|
|
|
## π Research & Citations |
|
|
|
We built this based on the following key research works: |
|
|
|
### GMM & Probabilistic Models |
|
1. Dempster et al. (1977) β Maximum likelihood via EM |
|
2. Bishop, C. (2006) β *Pattern Recognition and Machine Learning* |
|
3. McLachlan & Peel (2000) β *Finite Mixture Models* |
|
4. Reynolds et al. (2009) β Gaussian Mixture Modeling for Classification |
|
5. Bouveyron et al. (2007) β High-dimensional GMM classification |
|
|
|
### Imbalanced Classification |
|
6. Chawla et al. (2002) β SMOTE |
|
7. He & Garcia (2009) β Learning from Imbalanced Data |
|
8. Japkowicz (2000) β The Class Imbalance Problem: A Historical Perspective |
|
9. Buda et al. (2018) β A systematic study of class imbalance |
|
10. Liu et al. (2009) β EasyEnsemble and BalanceCascade |
|
|
|
### Evaluation Metrics |
|
11. Sokolova & Lapalme (2009) β A systematic analysis of performance measures |
|
12. Van Rijsbergen (1979) β Information Retrieval (F-measure origin) |
|
|
|
### Dataset Papers |
|
13. Dua & Graff (2019) β UCI Machine Learning Repository |
|
14. Lichman (2013) β Adult Dataset |
|
15. Dal Pozzolo et al. (2015) β Credit Card Fraud Dataset |
|
|
|
### Recent Works & Variants |
|
16. Loquercio et al. (2020) β Generative Models for Anomaly Detection |
|
17. Roy et al. (2022) β GMM on Tabular Data |
|
18. Fuchs et al. (2023) β Robust GMM Variants |
|
19. Ren et al. (2023) β Mixture of Experts for Class Imbalance |
|
20. Guo et al. (2021) β Bayesian GMMs in Skewed Data |
|
21. Cao et al. (2021) β Confidence-aware GMMs |
|
22. Wang et al. (2023) β Deep Mixture Models for Rare Class Learning |
|
23. Han et al. (2022) β Label Noise and GMM |
|
24. Kim et al. (2022) β Hybrid GMM for Multi-Class Tabular Data |
|
25. Cortes et al. (2025) β Margin-aware Mixture Models |
|
|
|
--- |
|
|
|
## π€ Push to Hugging Face |
|
|
|
To publish the trained GMM: |
|
|
|
```bash |
|
huggingface-cli login |
|
python push_to_huggingface.py |
|
``` |
|
|
|
You can also use: |
|
```bash |
|
huggingface-cli repo create gmm-imbalance-model --type=model |
|
``` |
|
|
|
--- |
|
|
|
## π Authors |
|
|
|
**Saurav Singla** |
|
π¬ [github.com/sauravsingla](https://github.com/sauravsingla) |
|
|
|
--- |
|
|
|
|
|
|
|
--- |
|
|
|
## π¦ Pretrained GMM Model |
|
|
|
We provide a pretrained Gaussian Mixture Model as `gmm_pretrained_model.pkl` inside this repository. |
|
|
|
### π§ Load Model in Python |
|
|
|
```python |
|
from joblib import load |
|
|
|
# Load from local file |
|
model_bundle = load("gmm_pretrained_model.pkl") |
|
scaler = model_bundle["scaler"] |
|
model_0 = model_bundle["model_0"] |
|
model_1 = model_bundle["model_1"] |
|
|
|
# Predict |
|
X_scaled = scaler.transform(X_new) |
|
score_0 = model_0.score_samples(X_scaled) |
|
score_1 = model_1.score_samples(X_scaled) |
|
y_pred = (score_1 > score_0).astype(int) |
|
``` |
|
|
|
### π Load from Hugging Face |
|
|
|
```python |
|
from huggingface_hub import hf_hub_download |
|
model_path = hf_hub_download(repo_id="YOUR_USERNAME/gmm-imbalance-model", filename="gmm_pretrained_model.pkl") |
|
model_bundle = load(model_path) |
|
``` |
|
|
|
--- |
|
|
|
|