Faham
Update README to remove lower quality demo image
3d750b3 unverified

A newer version of the Streamlit SDK is available: 1.49.1

Upgrade
metadata
title: Multimodal Sentiment Analysis
emoji: 🧠
colorFrom: blue
colorTo: purple
sdk: streamlit
sdk_version: 1.48.1
app_file: app.py
pinned: false

Multimodal Sentiment Analysis

A comprehensive Streamlit application that combines three different sentiment analysis models: text, audio, and vision-based sentiment analysis. The project demonstrates how to integrate multiple AI models for comprehensive sentiment understanding across different modalities.

Demo GIF

What is it?

This project implements a fused sentiment analysis system that combines predictions from three independent models:

1. Text Sentiment Analysis

  • Model: TextBlob NLP library
  • Capability: Analyzes text input for positive, negative, or neutral sentiment
  • Status: βœ… Fully integrated and ready to use

2. Audio Sentiment Analysis

  • Model: Fine-tuned Wav2Vec2-base model
  • Training Data: RAVDESS + CREMA-D emotional speech datasets
  • Capability: Analyzes audio files and microphone recordings for sentiment
  • Features:
    • File upload support (WAV, MP3, M4A, FLAC)
    • Direct microphone recording (max 5 seconds)
    • Automatic preprocessing (16kHz sampling, 5s max duration)
  • Status: βœ… Fully integrated and ready to use

3. Vision Sentiment Analysis

  • Model: Fine-tuned ResNet-50 model
  • Training Data: FER2013 facial expression dataset
  • Capability: Analyzes images for facial expression-based sentiment
  • Features:
    • File upload support (PNG, JPG, JPEG, BMP, TIFF)
    • Camera capture functionality
    • Automatic face detection and preprocessing
    • Grayscale conversion and 224x224 resize
  • Status: βœ… Fully integrated and ready to use

4. Fused Model

  • Approach: Combines predictions from all three models
  • Capability: Provides comprehensive sentiment analysis across modalities
  • Status: βœ… Fully integrated and ready to use

5. 🎬 Max Fusion

  • Approach: Video-based comprehensive sentiment analysis
  • Capability: Analyzes 5-second videos by extracting frames, audio, and transcribing speech
  • Features:
    • Video recording or file upload (MP4, AVI, MOV, MKV, WMV, FLV)
    • Automatic frame extraction for vision analysis
    • Audio extraction for vocal sentiment analysis
    • Speech-to-text transcription for text sentiment analysis
    • Combined results from all three modalities
  • Status: βœ… Fully integrated and ready to use

Project Structure

sentiment-fused/
β”œβ”€β”€ app.py                          # Main Streamlit application
β”œβ”€β”€ simple_model_manager.py         # Model management and Google Drive integration
β”œβ”€β”€ requirements.txt                # Python dependencies
β”œβ”€β”€ pyproject.toml                 # Project configuration
β”œβ”€β”€ Dockerfile                     # Container deployment
β”œβ”€β”€ notebooks/                     # Development notebooks
β”‚   β”œβ”€β”€ audio_sentiment_analysis.ipynb    # Audio model development
β”‚   └── vision_sentiment_analysis.ipynb   # Vision model development
β”œβ”€β”€ model_weights/                 # Model storage directory (downloaded .pth files)
└── src/                           # Source code package
    β”œβ”€β”€ __init__.py               # Package initialization
    β”œβ”€β”€ config/                   # Configuration settings
    β”œβ”€β”€ models/                   # Model logic and inference code
    β”œβ”€β”€ utils/                    # Utility functions and preprocessing
    └── ui/                       # User interface components

Key Features

  • Real-time Analysis: Instant sentiment predictions with confidence scores
  • Smart Preprocessing: Automatic file format handling and preprocessing
  • Multi-Page Interface: Clean navigation between different sentiment analysis modes
  • Model Management: Automatic model downloading from Google Drive
  • File Support: Multiple audio and image format support
  • Camera & Microphone: Direct input capture capabilities

Prerequisites

  • Python 3.9 or higher
  • 4GB+ RAM (for model loading)
  • Internet connection (for initial model download)

Installation

  1. Clone the repository:

    git clone <your-repo-url>
    cd sentiment-fused
    
  2. Create a virtual environment (recommended):

    python -m venv venv
    
    # On Windows
    venv\Scripts\activate
    
    # On macOS/Linux
    source venv/bin/activate
    
  3. Install dependencies:

    pip install -r requirements.txt
    
  4. Set up environment variables: Create a .env file in the project root with:

    VISION_MODEL_DRIVE_ID=your_google_drive_vision_model_file_id_here
    AUDIO_MODEL_DRIVE_ID=your_google_drive_audio_model_file_id_here
    VISION_MODEL_FILENAME=resnet50_model.pth
    AUDIO_MODEL_FILENAME=wav2vec2_model.pth
    

Running Locally

  1. Start the Streamlit application:

    streamlit run app.py
    
  2. Open your browser and navigate to the URL shown in the terminal (usually http://localhost:8501)

  3. Navigate between pages using the sidebar:

    • 🏠 Home: Overview and welcome page
    • πŸ“ Text Sentiment: Analyze text with TextBlob
    • 🎡 Audio Sentiment: Analyze audio files or record with microphone
    • πŸ–ΌοΈ Vision Sentiment: Analyze images or capture with camera
    • πŸ”— Fused Model: Combine all three models
    • 🎬 Max Fusion: Video-based comprehensive analysis

Model Development

The project includes Jupyter notebooks that document the development process:

Audio Model (notebooks/audio_sentiment_analysis.ipynb)

  • Wav2Vec2-base fine-tuning on RAVDESS + CREMA-D datasets
  • Emotion-to-sentiment mapping (happy/surprised β†’ positive, sad/angry/fearful/disgust β†’ negative, neutral/calm β†’ neutral)
  • Audio preprocessing pipeline (16kHz sampling, 5s max duration)

Vision Model (notebooks/vision_sentiment_analysis.ipynb)

  • ResNet-50 fine-tuning on FER2013 dataset
  • Emotion-to-sentiment mapping (happy/surprise β†’ positive, angry/disgust/fear/sad β†’ negative, neutral β†’ neutral)
  • Image preprocessing pipeline (face detection, grayscale conversion, 224x224 resize)

Technical Implementation

Model Management

  • SimpleModelManager class handles model downloading from Google Drive
  • Automatic model caching and version management
  • Environment variable configuration for model URLs

Preprocessing Pipelines

  • Audio: Automatic resampling, duration limiting, feature extraction
  • Vision: Face detection, cropping, grayscale conversion, normalization
  • Text: Direct TextBlob processing

Streamlit Integration

  • Multi-page application with sidebar navigation
  • File upload widgets with format validation
  • Real-time camera and microphone input
  • Custom CSS styling for modern UI

Deployment

Docker Deployment

# Build the container
docker build -t sentiment-fused .

# Run the container
doc

Uploading multimodal-sentiment-analysis-video-demo.mp4…

ker run -p 7860:7860 sentiment-fused

The application will be available at http://localhost:7860

Local Development

# Run with custom port
streamlit run app.py --server.port 8502

# Run with custom address
streamlit run app.py --server.address 0.0.0.0

Troubleshooting

Common Issues

  1. Model Loading Errors:

    • Ensure environment variables are set correctly
    • Check internet connection for model downloads
    • Verify sufficient RAM (4GB+ recommended)
  2. Dependency Issues:

    • Use virtual environment to avoid conflicts
    • Install PyTorch with CUDA support if using GPU
    • Ensure OpenCV is properly installed for face detection
  3. Performance Issues:

    • Large audio/image files may cause memory issues
    • Consider file size limits for better performance
    • GPU acceleration available for PyTorch models

Model Testing

# Test vision model
python -c "from simple_model_manager import SimpleModelManager; m = SimpleModelManager(); print('Vision model:', m.load_vision_model()[0] is not None)"

# Test audio model
python -c "from simple_model_manager import SimpleModelManager; m = SimpleModelManager(); print('Audio model:', m.load_audio_model()[0] is not None)"

Dependencies

Key libraries used:

  • Streamlit: Web application framework
  • PyTorch: Deep learning framework
  • Transformers: Hugging Face model library
  • OpenCV: Computer vision and face detection
  • Librosa: Audio processing
  • TextBlob: Natural language processing
  • Gdown: Google Drive file downloader
  • MoviePy: Video processing and audio extraction
  • SpeechRecognition: Audio transcription

What This Project Demonstrates

  1. Multimodal AI Integration: Combining text, audio, and vision models
  2. Model Management: Automated downloading and caching of pre-trained models
  3. Real-time Processing: Live audio recording and camera capture
  4. Smart Preprocessing: Automatic format conversion and optimization
  5. Modern Web UI: Professional Streamlit application with custom styling
  6. Production Ready: Docker containerization and deployment
  7. Video Analysis: Comprehensive video processing with multi-modal extraction
  8. Speech Recognition: Audio-to-text transcription for enhanced analysis
  9. Modular Architecture: Clean, maintainable code structure with separated concerns
  10. Professional Code Organization: Proper Python packaging with config, models, utils, and UI modules

Recent Improvements

The project has been refactored from a monolithic structure to a clean, modular architecture:

  • Modular Design: Separated into logical modules (src/config/, src/models/, src/utils/, src/ui/)
  • Centralized Configuration: All settings consolidated in src/config/settings.py
  • Clean Separation: Model logic, preprocessing, and UI components are now in dedicated modules
  • Better Maintainability: Easier to modify, test, and extend individual components
  • Professional Structure: Follows Python packaging best practices

This project serves as a comprehensive example of building production-ready multimodal AI applications with modern Python tools and frameworks.