Spaces:
Build error
Build error
# Insurance Fraud Prediction Model | |
This project focuses on building and evaluating a machine learning model to detect fraudulent insurance claims. | |
The project involves data preprocessing, model training using a RandomForestClassifier, model evaluation with | |
various metrics and visualizations, and a Streamlit UI for interacting with the model. | |
Create and activate a virtual environment: | |
```bash | |
python -m venv env | |
source env/bin/activate # On Windows use `env\Scripts\activate` | |
``` | |
Install the required packages: | |
```bash | |
pip install -r requirements.txt | |
``` | |
### Project Structure | |
```bash | |
insurance-fraud-detection/ | |
β | |
βββ dataset/ | |
β βββ insurance_claims.csv | |
β | |
βββ model/ | |
β βββ only_model.joblib | |
β | |
βββ train.py | |
βββ prediction.py | |
βββ app.py | |
βββ requirements.txt | |
βββ README.md | |
``` | |
### Data Preprocessing | |
#### Data Loading | |
The data is loaded from a CSV file located at dataset/insurance_claims.csv. During loading, the following steps are | |
performed: | |
- Drop the _c39 column. | |
- Replace '?' with NaN. | |
#### Data Cleaning | |
Fill missing values for 'property_damage', 'police_report_available', and 'collision_type' columns with their mode. | |
Drop duplicate records. | |
#### Encoding and Feature Selection | |
Encode categorical variables using Label Encoding. | |
Drop unnecessary columns that are not relevant for the model. | |
Select the final set of features for the model. | |
#### Preprocessed Features | |
The final set of features used for model training: | |
incident_severity | |
insured_hobbies | |
total_claim_amount | |
months_as_customer | |
policy_annual_premium | |
incident_date | |
capital-loss | |
capital-gains | |
insured_education_level | |
incident_city | |
fraud_reported (target variable) | |
#### Model Training | |
The model is trained using a RandomForestClassifier with a pipeline that includes preprocessing steps and | |
hyperparameter tuning using GridSearchCV. | |
#### Training Steps | |
Train-test split: The data is split into training and testing sets with a 70-30 split. | |
Pipeline setup: A pipeline is created to include preprocessing and model training. | |
Hyperparameter tuning: A grid search is performed to find the best hyperparameters. | |
Model training: The best model is trained on the training data. | |
Model saving: The trained model is saved as fraud_insurance_pipeline.joblib. | |
#### Model Evaluation | |
The trained model is evaluated using the test set. The evaluation metrics include: | |
Classification Report: Precision, Recall, F1-score. | |
AUC Score: Area Under the ROC Curve. | |
Confusion Matrix: Visual representation of true vs. predicted values. | |
ROC Curve: Receiver Operating Characteristic curve. | |
### Usage | |
#### Training the Model | |
To train the model, run the following command: | |
```bash | |
python train.py | |
``` | |
#### Evaluating the Model | |
To evaluate the model, run the following command: | |
```bash | |
python predict.py | |
``` | |
#### Running the Streamlit App | |
To run the Streamlit app, use the following command: | |
```bash | |
streamlit run streamlit_app.py | |
``` |