File size: 10,344 Bytes
550e054 f2e0cb4 550e054 03dacaf 550e054 f2e0cb4 4b35e49 93e56c4 25ef7fd 3872e34 93e56c4 1d798d1 93e56c4 db77419 93e56c4 4b35e49 93e56c4 4b35e49 93e56c4 4b35e49 93e56c4 4b35e49 93e56c4 4b35e49 93e56c4 4b35e49 93e56c4 4b35e49 93e56c4 4b35e49 1d798d1 4b35e49 93e56c4 4b35e49 93e56c4 4b35e49 93e56c4 4b35e49 93e56c4 4b35e49 93e56c4 4b35e49 93e56c4 4b35e49 93e56c4 4b35e49 93e56c4 4b35e49 93e56c4 4b35e49 93e56c4 4b35e49 93e56c4 4b35e49 93e56c4 4b35e49 93e56c4 4b35e49 93e56c4 4b35e49 93e56c4 4b35e49 93e56c4 4b35e49 93e56c4 eadeae8 4b35e49 93e56c4 4b35e49 93e56c4 4b35e49 93e56c4 4b35e49 93e56c4 4b35e49 93e56c4 4b35e49 93e56c4 4b35e49 93e56c4 4b35e49 93e56c4 4b35e49 93e56c4 4b35e49 93e56c4 4b35e49 93e56c4 4b35e49 93e56c4 4b35e49 93e56c4 4b35e49 93e56c4 4b35e49 93e56c4 4b35e49 93e56c4 1d798d1 4b35e49 93e56c4 4b35e49 93e56c4 1d798d1 db77419 4b35e49 93e56c4 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 |
---
title: Multimodal Sentiment Analysis
emoji: π§
colorFrom: blue
colorTo: purple
sdk: streamlit
sdk_version: "1.48.1"
app_file: app.py
pinned: false
---
# Multimodal Sentiment Analysis
A comprehensive Streamlit application that combines three different sentiment analysis models: text, audio, and vision-based sentiment analysis. The project demonstrates how to integrate multiple AI models for comprehensive sentiment understanding across different modalities.

## What is it?
This project implements a **fused sentiment analysis system** that combines predictions from three independent models:
### 1. Text Sentiment Analysis
- **Model**: TextBlob NLP library
- **Capability**: Analyzes text input for positive, negative, or neutral sentiment
- **Status**: β
Fully integrated and ready to use
### 2. Audio Sentiment Analysis
- **Model**: Fine-tuned Wav2Vec2-base model
- **Training Data**: RAVDESS + CREMA-D emotional speech datasets
- **Capability**: Analyzes audio files and microphone recordings for sentiment
- **Features**:
- File upload support (WAV, MP3, M4A, FLAC)
- Direct microphone recording (max 5 seconds)
- Automatic preprocessing (16kHz sampling, 5s max duration)
- **Status**: β
Fully integrated and ready to use
### 3. Vision Sentiment Analysis
- **Model**: Fine-tuned ResNet-50 model
- **Training Data**: FER2013 facial expression dataset
- **Capability**: Analyzes images for facial expression-based sentiment
- **Features**:
- File upload support (PNG, JPG, JPEG, BMP, TIFF)
- Camera capture functionality
- Automatic face detection and preprocessing
- Grayscale conversion and 224x224 resize
- **Status**: β
Fully integrated and ready to use
### 4. Fused Model
- **Approach**: Combines predictions from all three models
- **Capability**: Provides comprehensive sentiment analysis across modalities
- **Status**: β
Fully integrated and ready to use
### 5. π¬ Max Fusion
- **Approach**: Video-based comprehensive sentiment analysis
- **Capability**: Analyzes 5-second videos by extracting frames, audio, and transcribing speech
- **Features**:
- Video recording or file upload (MP4, AVI, MOV, MKV, WMV, FLV)
- Automatic frame extraction for vision analysis
- Audio extraction for vocal sentiment analysis
- Speech-to-text transcription for text sentiment analysis
- Combined results from all three modalities
- **Status**: β
Fully integrated and ready to use
## Project Structure
```
sentiment-fused/
βββ app.py # Main Streamlit application
βββ simple_model_manager.py # Model management and Google Drive integration
βββ requirements.txt # Python dependencies
βββ pyproject.toml # Project configuration
βββ Dockerfile # Container deployment
βββ notebooks/ # Development notebooks
β βββ audio_sentiment_analysis.ipynb # Audio model development
β βββ vision_sentiment_analysis.ipynb # Vision model development
βββ model_weights/ # Model storage directory (downloaded .pth files)
βββ src/ # Source code package
βββ __init__.py # Package initialization
βββ config/ # Configuration settings
βββ models/ # Model logic and inference code
βββ utils/ # Utility functions and preprocessing
βββ ui/ # User interface components
```
## Key Features
- **Real-time Analysis**: Instant sentiment predictions with confidence scores
- **Smart Preprocessing**: Automatic file format handling and preprocessing
- **Multi-Page Interface**: Clean navigation between different sentiment analysis modes
- **Model Management**: Automatic model downloading from Google Drive
- **File Support**: Multiple audio and image format support
- **Camera & Microphone**: Direct input capture capabilities
## Prerequisites
- Python 3.9 or higher
- 4GB+ RAM (for model loading)
- Internet connection (for initial model download)
## Installation
1. **Clone the repository**:
```bash
git clone <your-repo-url>
cd sentiment-fused
```
2. **Create a virtual environment** (recommended):
```bash
python -m venv venv
# On Windows
venv\Scripts\activate
# On macOS/Linux
source venv/bin/activate
```
3. **Install dependencies**:
```bash
pip install -r requirements.txt
```
4. **Set up environment variables**:
Create a `.env` file in the project root with:
```env
VISION_MODEL_DRIVE_ID=your_google_drive_vision_model_file_id_here
AUDIO_MODEL_DRIVE_ID=your_google_drive_audio_model_file_id_here
VISION_MODEL_FILENAME=resnet50_model.pth
AUDIO_MODEL_FILENAME=wav2vec2_model.pth
```
## Running Locally
1. **Start the Streamlit application**:
```bash
streamlit run app.py
```
2. **Open your browser** and navigate to the URL shown in the terminal (usually `http://localhost:8501`)
3. **Navigate between pages** using the sidebar:
- π **Home**: Overview and welcome page
- π **Text Sentiment**: Analyze text with TextBlob
- π΅ **Audio Sentiment**: Analyze audio files or record with microphone
- πΌοΈ **Vision Sentiment**: Analyze images or capture with camera
- π **Fused Model**: Combine all three models
- π¬ **Max Fusion**: Video-based comprehensive analysis
## Model Development
The project includes Jupyter notebooks that document the development process:
### Audio Model (`notebooks/audio_sentiment_analysis.ipynb`)
- Wav2Vec2-base fine-tuning on RAVDESS + CREMA-D datasets
- Emotion-to-sentiment mapping (happy/surprised β positive, sad/angry/fearful/disgust β negative, neutral/calm β neutral)
- Audio preprocessing pipeline (16kHz sampling, 5s max duration)
### Vision Model (`notebooks/vision_sentiment_analysis.ipynb`)
- ResNet-50 fine-tuning on FER2013 dataset
- Emotion-to-sentiment mapping (happy/surprise β positive, angry/disgust/fear/sad β negative, neutral β neutral)
- Image preprocessing pipeline (face detection, grayscale conversion, 224x224 resize)
## Technical Implementation
### Model Management
- `SimpleModelManager` class handles model downloading from Google Drive
- Automatic model caching and version management
- Environment variable configuration for model URLs
### Preprocessing Pipelines
- **Audio**: Automatic resampling, duration limiting, feature extraction
- **Vision**: Face detection, cropping, grayscale conversion, normalization
- **Text**: Direct TextBlob processing
### Streamlit Integration
- Multi-page application with sidebar navigation
- File upload widgets with format validation
- Real-time camera and microphone input
- Custom CSS styling for modern UI
## Deployment
### Docker Deployment
```bash
# Build the container
docker build -t sentiment-fused .
# Run the container
doc
Uploading multimodal-sentiment-analysis-video-demo.mp4β¦
ker run -p 7860:7860 sentiment-fused
```
The application will be available at `http://localhost:7860`
### Local Development
```bash
# Run with custom port
streamlit run app.py --server.port 8502
# Run with custom address
streamlit run app.py --server.address 0.0.0.0
```
## Troubleshooting
### Common Issues
1. **Model Loading Errors**:
- Ensure environment variables are set correctly
- Check internet connection for model downloads
- Verify sufficient RAM (4GB+ recommended)
2. **Dependency Issues**:
- Use virtual environment to avoid conflicts
- Install PyTorch with CUDA support if using GPU
- Ensure OpenCV is properly installed for face detection
3. **Performance Issues**:
- Large audio/image files may cause memory issues
- Consider file size limits for better performance
- GPU acceleration available for PyTorch models
### Model Testing
```bash
# Test vision model
python -c "from simple_model_manager import SimpleModelManager; m = SimpleModelManager(); print('Vision model:', m.load_vision_model()[0] is not None)"
# Test audio model
python -c "from simple_model_manager import SimpleModelManager; m = SimpleModelManager(); print('Audio model:', m.load_audio_model()[0] is not None)"
```
## Dependencies
Key libraries used:
- **Streamlit**: Web application framework
- **PyTorch**: Deep learning framework
- **Transformers**: Hugging Face model library
- **OpenCV**: Computer vision and face detection
- **Librosa**: Audio processing
- **TextBlob**: Natural language processing
- **Gdown**: Google Drive file downloader
- **MoviePy**: Video processing and audio extraction
- **SpeechRecognition**: Audio transcription
## What This Project Demonstrates
1. **Multimodal AI Integration**: Combining text, audio, and vision models
2. **Model Management**: Automated downloading and caching of pre-trained models
3. **Real-time Processing**: Live audio recording and camera capture
4. **Smart Preprocessing**: Automatic format conversion and optimization
5. **Modern Web UI**: Professional Streamlit application with custom styling
6. **Production Ready**: Docker containerization and deployment
7. **Video Analysis**: Comprehensive video processing with multi-modal extraction
8. **Speech Recognition**: Audio-to-text transcription for enhanced analysis
9. **Modular Architecture**: Clean, maintainable code structure with separated concerns
10. **Professional Code Organization**: Proper Python packaging with config, models, utils, and UI modules
## Recent Improvements
The project has been refactored from a monolithic structure to a clean, modular architecture:
- **Modular Design**: Separated into logical modules (`src/config/`, `src/models/`, `src/utils/`, `src/ui/`)
- **Centralized Configuration**: All settings consolidated in `src/config/settings.py`
- **Clean Separation**: Model logic, preprocessing, and UI components are now in dedicated modules
- **Better Maintainability**: Easier to modify, test, and extend individual components
- **Professional Structure**: Follows Python packaging best practices
This project serves as a comprehensive example of building production-ready multimodal AI applications with modern Python tools and frameworks.
|