A newer version of the Streamlit SDK is available:
1.49.1
title: Multimodal Sentiment Analysis
emoji: π§
colorFrom: blue
colorTo: purple
sdk: streamlit
sdk_version: 1.48.1
app_file: app.py
pinned: false
Multimodal Sentiment Analysis
A comprehensive Streamlit application that combines three different sentiment analysis models: text, audio, and vision-based sentiment analysis. The project demonstrates how to integrate multiple AI models for comprehensive sentiment understanding across different modalities.
What is it?
This project implements a fused sentiment analysis system that combines predictions from three independent models:
1. Text Sentiment Analysis
- Model: TextBlob NLP library
- Capability: Analyzes text input for positive, negative, or neutral sentiment
- Status: β Fully integrated and ready to use
2. Audio Sentiment Analysis
- Model: Fine-tuned Wav2Vec2-base model
- Training Data: RAVDESS + CREMA-D emotional speech datasets
- Capability: Analyzes audio files and microphone recordings for sentiment
- Features:
- File upload support (WAV, MP3, M4A, FLAC)
- Direct microphone recording (max 5 seconds)
- Automatic preprocessing (16kHz sampling, 5s max duration)
- Status: β Fully integrated and ready to use
3. Vision Sentiment Analysis
- Model: Fine-tuned ResNet-50 model
- Training Data: FER2013 facial expression dataset
- Capability: Analyzes images for facial expression-based sentiment
- Features:
- File upload support (PNG, JPG, JPEG, BMP, TIFF)
- Camera capture functionality
- Automatic face detection and preprocessing
- Grayscale conversion and 224x224 resize
- Status: β Fully integrated and ready to use
4. Fused Model
- Approach: Combines predictions from all three models
- Capability: Provides comprehensive sentiment analysis across modalities
- Status: β Fully integrated and ready to use
5. π¬ Max Fusion
- Approach: Video-based comprehensive sentiment analysis
- Capability: Analyzes 5-second videos by extracting frames, audio, and transcribing speech
- Features:
- Video recording or file upload (MP4, AVI, MOV, MKV, WMV, FLV)
- Automatic frame extraction for vision analysis
- Audio extraction for vocal sentiment analysis
- Speech-to-text transcription for text sentiment analysis
- Combined results from all three modalities
- Status: β Fully integrated and ready to use
Project Structure
sentiment-fused/
βββ app.py # Main Streamlit application
βββ simple_model_manager.py # Model management and Google Drive integration
βββ requirements.txt # Python dependencies
βββ pyproject.toml # Project configuration
βββ Dockerfile # Container deployment
βββ notebooks/ # Development notebooks
β βββ audio_sentiment_analysis.ipynb # Audio model development
β βββ vision_sentiment_analysis.ipynb # Vision model development
βββ model_weights/ # Model storage directory (downloaded .pth files)
βββ src/ # Source code package
βββ __init__.py # Package initialization
βββ config/ # Configuration settings
βββ models/ # Model logic and inference code
βββ utils/ # Utility functions and preprocessing
βββ ui/ # User interface components
Key Features
- Real-time Analysis: Instant sentiment predictions with confidence scores
- Smart Preprocessing: Automatic file format handling and preprocessing
- Multi-Page Interface: Clean navigation between different sentiment analysis modes
- Model Management: Automatic model downloading from Google Drive
- File Support: Multiple audio and image format support
- Camera & Microphone: Direct input capture capabilities
Prerequisites
- Python 3.9 or higher
- 4GB+ RAM (for model loading)
- Internet connection (for initial model download)
Installation
Clone the repository:
git clone <your-repo-url> cd sentiment-fused
Create a virtual environment (recommended):
python -m venv venv # On Windows venv\Scripts\activate # On macOS/Linux source venv/bin/activate
Install dependencies:
pip install -r requirements.txt
Set up environment variables: Create a
.env
file in the project root with:VISION_MODEL_DRIVE_ID=your_google_drive_vision_model_file_id_here AUDIO_MODEL_DRIVE_ID=your_google_drive_audio_model_file_id_here VISION_MODEL_FILENAME=resnet50_model.pth AUDIO_MODEL_FILENAME=wav2vec2_model.pth
Running Locally
Start the Streamlit application:
streamlit run app.py
Open your browser and navigate to the URL shown in the terminal (usually
http://localhost:8501
)Navigate between pages using the sidebar:
- π Home: Overview and welcome page
- π Text Sentiment: Analyze text with TextBlob
- π΅ Audio Sentiment: Analyze audio files or record with microphone
- πΌοΈ Vision Sentiment: Analyze images or capture with camera
- π Fused Model: Combine all three models
- π¬ Max Fusion: Video-based comprehensive analysis
Model Development
The project includes Jupyter notebooks that document the development process:
Audio Model (notebooks/audio_sentiment_analysis.ipynb
)
- Wav2Vec2-base fine-tuning on RAVDESS + CREMA-D datasets
- Emotion-to-sentiment mapping (happy/surprised β positive, sad/angry/fearful/disgust β negative, neutral/calm β neutral)
- Audio preprocessing pipeline (16kHz sampling, 5s max duration)
Vision Model (notebooks/vision_sentiment_analysis.ipynb
)
- ResNet-50 fine-tuning on FER2013 dataset
- Emotion-to-sentiment mapping (happy/surprise β positive, angry/disgust/fear/sad β negative, neutral β neutral)
- Image preprocessing pipeline (face detection, grayscale conversion, 224x224 resize)
Technical Implementation
Model Management
SimpleModelManager
class handles model downloading from Google Drive- Automatic model caching and version management
- Environment variable configuration for model URLs
Preprocessing Pipelines
- Audio: Automatic resampling, duration limiting, feature extraction
- Vision: Face detection, cropping, grayscale conversion, normalization
- Text: Direct TextBlob processing
Streamlit Integration
- Multi-page application with sidebar navigation
- File upload widgets with format validation
- Real-time camera and microphone input
- Custom CSS styling for modern UI
Deployment
Docker Deployment
# Build the container
docker build -t sentiment-fused .
# Run the container
doc
Uploading multimodal-sentiment-analysis-video-demo.mp4β¦
ker run -p 7860:7860 sentiment-fused
The application will be available at http://localhost:7860
Local Development
# Run with custom port
streamlit run app.py --server.port 8502
# Run with custom address
streamlit run app.py --server.address 0.0.0.0
Troubleshooting
Common Issues
Model Loading Errors:
- Ensure environment variables are set correctly
- Check internet connection for model downloads
- Verify sufficient RAM (4GB+ recommended)
Dependency Issues:
- Use virtual environment to avoid conflicts
- Install PyTorch with CUDA support if using GPU
- Ensure OpenCV is properly installed for face detection
Performance Issues:
- Large audio/image files may cause memory issues
- Consider file size limits for better performance
- GPU acceleration available for PyTorch models
Model Testing
# Test vision model
python -c "from simple_model_manager import SimpleModelManager; m = SimpleModelManager(); print('Vision model:', m.load_vision_model()[0] is not None)"
# Test audio model
python -c "from simple_model_manager import SimpleModelManager; m = SimpleModelManager(); print('Audio model:', m.load_audio_model()[0] is not None)"
Dependencies
Key libraries used:
- Streamlit: Web application framework
- PyTorch: Deep learning framework
- Transformers: Hugging Face model library
- OpenCV: Computer vision and face detection
- Librosa: Audio processing
- TextBlob: Natural language processing
- Gdown: Google Drive file downloader
- MoviePy: Video processing and audio extraction
- SpeechRecognition: Audio transcription
What This Project Demonstrates
- Multimodal AI Integration: Combining text, audio, and vision models
- Model Management: Automated downloading and caching of pre-trained models
- Real-time Processing: Live audio recording and camera capture
- Smart Preprocessing: Automatic format conversion and optimization
- Modern Web UI: Professional Streamlit application with custom styling
- Production Ready: Docker containerization and deployment
- Video Analysis: Comprehensive video processing with multi-modal extraction
- Speech Recognition: Audio-to-text transcription for enhanced analysis
- Modular Architecture: Clean, maintainable code structure with separated concerns
- Professional Code Organization: Proper Python packaging with config, models, utils, and UI modules
Recent Improvements
The project has been refactored from a monolithic structure to a clean, modular architecture:
- Modular Design: Separated into logical modules (
src/config/
,src/models/
,src/utils/
,src/ui/
) - Centralized Configuration: All settings consolidated in
src/config/settings.py
- Clean Separation: Model logic, preprocessing, and UI components are now in dedicated modules
- Better Maintainability: Easier to modify, test, and extend individual components
- Professional Structure: Follows Python packaging best practices
This project serves as a comprehensive example of building production-ready multimodal AI applications with modern Python tools and frameworks.