File size: 10,344 Bytes
550e054
f2e0cb4
550e054
 
 
 
03dacaf
550e054
 
 
 
f2e0cb4
4b35e49
93e56c4
 
25ef7fd
3872e34
93e56c4
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1d798d1
 
 
 
 
 
 
 
 
 
 
 
93e56c4
 
 
 
 
 
 
 
 
 
 
 
db77419
 
 
 
 
 
 
93e56c4
 
 
 
 
 
 
 
4b35e49
93e56c4
4b35e49
93e56c4
4b35e49
 
93e56c4
 
4b35e49
93e56c4
4b35e49
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
93e56c4
4b35e49
 
 
 
93e56c4
 
 
 
 
 
 
 
 
 
4b35e49
 
 
 
 
 
 
 
 
 
 
93e56c4
 
 
4b35e49
1d798d1
4b35e49
93e56c4
4b35e49
93e56c4
4b35e49
93e56c4
4b35e49
93e56c4
 
 
4b35e49
93e56c4
4b35e49
93e56c4
 
 
4b35e49
93e56c4
4b35e49
93e56c4
4b35e49
93e56c4
 
 
4b35e49
93e56c4
4b35e49
93e56c4
 
 
4b35e49
93e56c4
4b35e49
93e56c4
 
 
 
4b35e49
93e56c4
4b35e49
93e56c4
4b35e49
93e56c4
 
 
4b35e49
93e56c4
eadeae8
 
 
 
 
4b35e49
 
93e56c4
4b35e49
93e56c4
4b35e49
93e56c4
 
 
4b35e49
93e56c4
 
 
4b35e49
93e56c4
4b35e49
 
 
93e56c4
4b35e49
93e56c4
 
 
4b35e49
93e56c4
4b35e49
93e56c4
 
 
4b35e49
93e56c4
 
 
 
4b35e49
93e56c4
4b35e49
93e56c4
 
 
4b35e49
93e56c4
 
 
4b35e49
93e56c4
4b35e49
93e56c4
4b35e49
93e56c4
 
 
 
 
 
 
1d798d1
 
4b35e49
93e56c4
4b35e49
93e56c4
 
 
 
 
 
1d798d1
 
db77419
 
 
 
 
 
 
 
 
 
 
 
4b35e49
93e56c4
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
---
title: Multimodal Sentiment Analysis
emoji: 🧠
colorFrom: blue
colorTo: purple
sdk: streamlit
sdk_version: "1.48.1"
app_file: app.py
pinned: false
---

# Multimodal Sentiment Analysis

A comprehensive Streamlit application that combines three different sentiment analysis models: text, audio, and vision-based sentiment analysis. The project demonstrates how to integrate multiple AI models for comprehensive sentiment understanding across different modalities.

![Demo GIF](https://github.com/user-attachments/assets/ac6ed8dc-e225-44a8-a6f1-c2d6b318adf4)

## What is it?

This project implements a **fused sentiment analysis system** that combines predictions from three independent models:

### 1. Text Sentiment Analysis

- **Model**: TextBlob NLP library
- **Capability**: Analyzes text input for positive, negative, or neutral sentiment
- **Status**: βœ… Fully integrated and ready to use

### 2. Audio Sentiment Analysis

- **Model**: Fine-tuned Wav2Vec2-base model
- **Training Data**: RAVDESS + CREMA-D emotional speech datasets
- **Capability**: Analyzes audio files and microphone recordings for sentiment
- **Features**:
  - File upload support (WAV, MP3, M4A, FLAC)
  - Direct microphone recording (max 5 seconds)
  - Automatic preprocessing (16kHz sampling, 5s max duration)
- **Status**: βœ… Fully integrated and ready to use

### 3. Vision Sentiment Analysis

- **Model**: Fine-tuned ResNet-50 model
- **Training Data**: FER2013 facial expression dataset
- **Capability**: Analyzes images for facial expression-based sentiment
- **Features**:
  - File upload support (PNG, JPG, JPEG, BMP, TIFF)
  - Camera capture functionality
  - Automatic face detection and preprocessing
  - Grayscale conversion and 224x224 resize
- **Status**: βœ… Fully integrated and ready to use

### 4. Fused Model

- **Approach**: Combines predictions from all three models
- **Capability**: Provides comprehensive sentiment analysis across modalities
- **Status**: βœ… Fully integrated and ready to use

### 5. 🎬 Max Fusion

- **Approach**: Video-based comprehensive sentiment analysis
- **Capability**: Analyzes 5-second videos by extracting frames, audio, and transcribing speech
- **Features**:
  - Video recording or file upload (MP4, AVI, MOV, MKV, WMV, FLV)
  - Automatic frame extraction for vision analysis
  - Audio extraction for vocal sentiment analysis
  - Speech-to-text transcription for text sentiment analysis
  - Combined results from all three modalities
- **Status**: βœ… Fully integrated and ready to use

## Project Structure

```
sentiment-fused/
β”œβ”€β”€ app.py                          # Main Streamlit application
β”œβ”€β”€ simple_model_manager.py         # Model management and Google Drive integration
β”œβ”€β”€ requirements.txt                # Python dependencies
β”œβ”€β”€ pyproject.toml                 # Project configuration
β”œβ”€β”€ Dockerfile                     # Container deployment
β”œβ”€β”€ notebooks/                     # Development notebooks
β”‚   β”œβ”€β”€ audio_sentiment_analysis.ipynb    # Audio model development
β”‚   └── vision_sentiment_analysis.ipynb   # Vision model development
β”œβ”€β”€ model_weights/                 # Model storage directory (downloaded .pth files)
└── src/                           # Source code package
    β”œβ”€β”€ __init__.py               # Package initialization
    β”œβ”€β”€ config/                   # Configuration settings
    β”œβ”€β”€ models/                   # Model logic and inference code
    β”œβ”€β”€ utils/                    # Utility functions and preprocessing
    └── ui/                       # User interface components
```

## Key Features

- **Real-time Analysis**: Instant sentiment predictions with confidence scores
- **Smart Preprocessing**: Automatic file format handling and preprocessing
- **Multi-Page Interface**: Clean navigation between different sentiment analysis modes
- **Model Management**: Automatic model downloading from Google Drive
- **File Support**: Multiple audio and image format support
- **Camera & Microphone**: Direct input capture capabilities

## Prerequisites

- Python 3.9 or higher
- 4GB+ RAM (for model loading)
- Internet connection (for initial model download)

## Installation

1. **Clone the repository**:

   ```bash
   git clone <your-repo-url>
   cd sentiment-fused
   ```

2. **Create a virtual environment** (recommended):

   ```bash
   python -m venv venv

   # On Windows
   venv\Scripts\activate

   # On macOS/Linux
   source venv/bin/activate
   ```

3. **Install dependencies**:

   ```bash
   pip install -r requirements.txt
   ```

4. **Set up environment variables**:
   Create a `.env` file in the project root with:
   ```env
   VISION_MODEL_DRIVE_ID=your_google_drive_vision_model_file_id_here
   AUDIO_MODEL_DRIVE_ID=your_google_drive_audio_model_file_id_here
   VISION_MODEL_FILENAME=resnet50_model.pth
   AUDIO_MODEL_FILENAME=wav2vec2_model.pth
   ```

## Running Locally

1. **Start the Streamlit application**:

   ```bash
   streamlit run app.py
   ```

2. **Open your browser** and navigate to the URL shown in the terminal (usually `http://localhost:8501`)

3. **Navigate between pages** using the sidebar:
   - 🏠 **Home**: Overview and welcome page
   - πŸ“ **Text Sentiment**: Analyze text with TextBlob
   - 🎡 **Audio Sentiment**: Analyze audio files or record with microphone
   - πŸ–ΌοΈ **Vision Sentiment**: Analyze images or capture with camera
   - πŸ”— **Fused Model**: Combine all three models
   - 🎬 **Max Fusion**: Video-based comprehensive analysis

## Model Development

The project includes Jupyter notebooks that document the development process:

### Audio Model (`notebooks/audio_sentiment_analysis.ipynb`)

- Wav2Vec2-base fine-tuning on RAVDESS + CREMA-D datasets
- Emotion-to-sentiment mapping (happy/surprised β†’ positive, sad/angry/fearful/disgust β†’ negative, neutral/calm β†’ neutral)
- Audio preprocessing pipeline (16kHz sampling, 5s max duration)

### Vision Model (`notebooks/vision_sentiment_analysis.ipynb`)

- ResNet-50 fine-tuning on FER2013 dataset
- Emotion-to-sentiment mapping (happy/surprise β†’ positive, angry/disgust/fear/sad β†’ negative, neutral β†’ neutral)
- Image preprocessing pipeline (face detection, grayscale conversion, 224x224 resize)

## Technical Implementation

### Model Management

- `SimpleModelManager` class handles model downloading from Google Drive
- Automatic model caching and version management
- Environment variable configuration for model URLs

### Preprocessing Pipelines

- **Audio**: Automatic resampling, duration limiting, feature extraction
- **Vision**: Face detection, cropping, grayscale conversion, normalization
- **Text**: Direct TextBlob processing

### Streamlit Integration

- Multi-page application with sidebar navigation
- File upload widgets with format validation
- Real-time camera and microphone input
- Custom CSS styling for modern UI

## Deployment

### Docker Deployment

```bash
# Build the container
docker build -t sentiment-fused .

# Run the container
doc

Uploading multimodal-sentiment-analysis-video-demo.mp4…

ker run -p 7860:7860 sentiment-fused
```

The application will be available at `http://localhost:7860`

### Local Development

```bash
# Run with custom port
streamlit run app.py --server.port 8502

# Run with custom address
streamlit run app.py --server.address 0.0.0.0
```

## Troubleshooting

### Common Issues

1. **Model Loading Errors**:

   - Ensure environment variables are set correctly
   - Check internet connection for model downloads
   - Verify sufficient RAM (4GB+ recommended)

2. **Dependency Issues**:

   - Use virtual environment to avoid conflicts
   - Install PyTorch with CUDA support if using GPU
   - Ensure OpenCV is properly installed for face detection

3. **Performance Issues**:
   - Large audio/image files may cause memory issues
   - Consider file size limits for better performance
   - GPU acceleration available for PyTorch models

### Model Testing

```bash
# Test vision model
python -c "from simple_model_manager import SimpleModelManager; m = SimpleModelManager(); print('Vision model:', m.load_vision_model()[0] is not None)"

# Test audio model
python -c "from simple_model_manager import SimpleModelManager; m = SimpleModelManager(); print('Audio model:', m.load_audio_model()[0] is not None)"
```

## Dependencies

Key libraries used:

- **Streamlit**: Web application framework
- **PyTorch**: Deep learning framework
- **Transformers**: Hugging Face model library
- **OpenCV**: Computer vision and face detection
- **Librosa**: Audio processing
- **TextBlob**: Natural language processing
- **Gdown**: Google Drive file downloader
- **MoviePy**: Video processing and audio extraction
- **SpeechRecognition**: Audio transcription

## What This Project Demonstrates

1. **Multimodal AI Integration**: Combining text, audio, and vision models
2. **Model Management**: Automated downloading and caching of pre-trained models
3. **Real-time Processing**: Live audio recording and camera capture
4. **Smart Preprocessing**: Automatic format conversion and optimization
5. **Modern Web UI**: Professional Streamlit application with custom styling
6. **Production Ready**: Docker containerization and deployment
7. **Video Analysis**: Comprehensive video processing with multi-modal extraction
8. **Speech Recognition**: Audio-to-text transcription for enhanced analysis
9. **Modular Architecture**: Clean, maintainable code structure with separated concerns
10. **Professional Code Organization**: Proper Python packaging with config, models, utils, and UI modules

## Recent Improvements

The project has been refactored from a monolithic structure to a clean, modular architecture:

- **Modular Design**: Separated into logical modules (`src/config/`, `src/models/`, `src/utils/`, `src/ui/`)
- **Centralized Configuration**: All settings consolidated in `src/config/settings.py`
- **Clean Separation**: Model logic, preprocessing, and UI components are now in dedicated modules
- **Better Maintainability**: Easier to modify, test, and extend individual components
- **Professional Structure**: Follows Python packaging best practices

This project serves as a comprehensive example of building production-ready multimodal AI applications with modern Python tools and frameworks.