File size: 3,497 Bytes
b3ab259
 
 
 
 
 
 
 
 
 
 
146e974
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
---
title: VQA
emoji: πŸš€
colorFrom: gray
colorTo: yellow
sdk: docker
pinned: false
license: mit
short_description: VQA API Endpoint
---

Check out the configuration reference at
https://huggingface.co/docs/hub/spaces-config-reference

# VizWiz Visual Question Answering API

This repository contains a FastAPI backend for a Visual Question Answering (VQA)
system trained on the VizWiz dataset.

## Features

- Upload images and ask questions about them
- Get answers with confidence scores
- Session management for asking multiple questions about the same image
- Health check endpoint for monitoring
- API documentation with Swagger UI

## Project Structure

```
project_root/
β”œβ”€β”€ app/
β”‚   β”œβ”€β”€ main.py               # Main FastAPI application
β”‚   β”œβ”€β”€ models/               # Model definitions
β”‚   β”‚   β”œβ”€β”€ __init__.py
β”‚   β”‚   └── vqa_model.py      # VQA model implementation
β”‚   β”œβ”€β”€ routers/              # API route definitions
β”‚   β”‚   β”œβ”€β”€ __init__.py
β”‚   β”‚   └── vqa.py            # VQA-related endpoints
β”‚   β”œβ”€β”€ services/             # Business logic
β”‚   β”‚   β”œβ”€β”€ __init__.py
β”‚   β”‚   β”œβ”€β”€ model_service.py  # Model loading and inference
β”‚   β”‚   └── session_service.py # Session management
β”‚   β”œβ”€β”€ utils/                # Utility functions
β”‚   β”‚   β”œβ”€β”€ __init__.py
β”‚   β”‚   └── image_utils.py    # Image processing utilities
β”‚   └── config.py             # Application configuration
β”œβ”€β”€ models/                   # Directory for model files
β”œβ”€β”€ uploads/                  # Directory for uploaded images
β”œβ”€β”€ .env                      # Environment variables
└── requirements.txt          # Project dependencies
```

## Installation

1. Clone the repository:

```bash
git clone https://github.com/dixisouls/vizwiz-vqa-api.git
cd vizwiz-vqa-api
```

2. Create a virtual environment:

```bash
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate
```

3. Install dependencies:

```bash
pip install -r requirements.txt
```

4. Create necessary directories:

```bash
mkdir -p models uploads
```

5. Place your trained model in the `models` directory.

6. Update the `.env` file with your configuration.

## Running the Application

```bash
uvicorn app.main:app --host 0.0.0.0 --port 8000 --reload
```

The API will be available at http://localhost:8000.

API documentation is available at:

- Swagger UI: http://localhost:8000/docs
- ReDoc: http://localhost:8000/redoc

## API Endpoints

### Health Check

```
GET /health
```

Returns the health status of the API.

### Upload Image

```
POST /api/vqa/upload
```

Upload an image and create a new session.

### Ask Question

```
POST /api/vqa/ask
```

Ask a question about an uploaded image.

### Get Session

```
GET /api/vqa/session/{session_id}
```

Get session information including question history.

### Reset Session

```
DELETE /api/vqa/session/{session_id}
```

Reset a session to start fresh.

## Environment Variables

- `DEBUG`: Enable debug mode (default: False)
- `MODEL_PATH`: Path to the trained model (default: ./models/vqa_model_best.pt)
- `TEXT_MODEL`: Name of the text model (default: bert-base-uncased)
- `VISION_MODEL`: Name of the vision model (default:
  google/vit-base-patch16-384)
- `HUGGINGFACE_TOKEN`: Hugging Face API token
- `UPLOAD_DIR`: Directory for uploaded images (default: ./uploads)

## License

[MIT License](LICENSE)