|
---
|
|
title: multimodel-rag-chat-with-videos
|
|
app_file: app.py
|
|
sdk: gradio
|
|
sdk_version: 5.17.1
|
|
---
|
|
|
|
# Demo
|
|
## Sample Video
|
|
- https://www.youtube.com/watch?v=kOEDG3j1bjs
|
|
- https://www.youtube.com/watch?v=7Hcg-rLYwdM
|
|
## Questions
|
|
- Event Horizon
|
|
- show me a group of astronauts, AStronaut name
|
|
# ReArchitecture Multimodal RAG System Pipeline Journey
|
|
I ported it locally and isolated each concept into a step as Python runnable
|
|
It is simplified, refactored and bug-fixed now.
|
|
I migrated from Prediction Guard to HuggingFace.
|
|
|
|
[**Interactive Video Chat Demo and Multimodal RAG System Architecture**](https://learn.deeplearning.ai/courses/multimodal-rag-chat-with-videos/lesson/2/interactive-demo-and-multimodal-rag-system-architecture)
|
|
|
|
### A multimodal AI system should be able to understand both text and video content.
|
|
|
|
## Setup
|
|
```bash
|
|
python -m venv venv
|
|
source venv/bin/activate
|
|
```
|
|
For Fish
|
|
```bash
|
|
source venv/bin/activate.fish
|
|
```
|
|
|
|
## Step 1 - Learn Gradio (UI) (30 mins)
|
|
|
|
Gradio is a powerful Python library for quickly building browser-based UIs. It supports hot reloading for fast development.
|
|
|
|
### Key Concepts:
|
|
- **fn**: The function wrapped by the UI.
|
|
- **inputs**: The Gradio components used for input (should match function arguments).
|
|
- **outputs**: The Gradio components used for output (should match return values).
|
|
|
|
📖 [**Gradio Documentation**](https://www.gradio.app/docs/gradio/introduction)
|
|
|
|
Gradio includes **30+ built-in components**.
|
|
|
|
💡 **Tip**: For `inputs` and `outputs`, you can pass either:
|
|
- The **component name** as a string (e.g., `"textbox"`)
|
|
- An **instance of the component class** (e.g., `gr.Textbox()`)
|
|
|
|
### Sharing Your Demo
|
|
```python
|
|
demo.launch(share=True) # Share your demo with just one extra parameter.
|
|
```
|
|
|
|
## Gradio Advanced Features
|
|
|
|
### **Gradio.Blocks**
|
|
Gradio provides `gr.Blocks`, a flexible way to design web apps with **custom layouts and complex interactions**:
|
|
- Arrange components freely on the page.
|
|
- Handle multiple data flows.
|
|
- Use outputs as inputs for other components.
|
|
- Dynamically update components based on user interaction.
|
|
|
|
### **Gradio.ChatInterface**
|
|
- Always set `type="messages"` in `gr.ChatInterface`.
|
|
- The default (`type="tuples"`) is **deprecated** and will be removed in future versions.
|
|
- For more UI flexibility, use `gr.ChatBot`.
|
|
- `gr.ChatInterface` supports **Markdown** (not tested yet).
|
|
|
|
---
|
|
|
|
## Step 2 - Learn Bridge Tower Embedding Model (Multimodal Learning) (15 mins)
|
|
|
|
Developed in collaboration with Intel, this model maps image-caption pairs into **512-dimensional vectors**.
|
|
|
|
### Measuring Similarity
|
|
- **Cosine Similarity** → Measures how close images are in vector space (**efficient & commonly used**).
|
|
- **Euclidean Distance** → Uses `cv2.NORM_L2` to compute similarity between two images.
|
|
|
|
### Converting to 2D for Visualization
|
|
- **UMAP** reduces 512D embeddings to **2D for display purposes**.
|
|
|
|
## Preprocessing Videos for Multimodal RAG
|
|
|
|
### **Case 1: WEBVTT → Extracting Text Segments from Video**
|
|
- Converts video + text into structured metadata.
|
|
- Splits content inhttps://www.youtube.com/watch?v=kOEDG3j1bjsto multiple segments.
|
|
|
|
### **Case 2: Whisper (Small) → Video Only**
|
|
- Extracts **audio** → `model.transcribe()`.
|
|
- Applies `getSubs()` helper function to retrieve **WEBVTT** subtitles.
|
|
- Uses **Case 1** processing.
|
|
|
|
### **Case 3: LvLM → Video + Silent/Music Extraction**
|
|
- Uses **Llava (LvLM model)** for **frame-based captioning**.
|
|
- Encodes each frame as a **Base64 image**.
|
|
- Extracts context and captions from video frames.
|
|
- Uses **Case 1** processing.
|
|
|
|
# Step 4 - What is LLaVA?
|
|
LLaVA (Large Language-and-Vision Assistant), a large multimodal model that connects a vision encoder that doesn't just see images but understands them, reads the text embedded in them, and reasons about their context—all.
|
|
|
|
# Step 5 - what is a vector Store?
|
|
A vector store is a specialized database designed to:
|
|
|
|
- Store and manage high-dimensional vector data efficiently
|
|
- Perform similarity-based searches where K=1 returns the most similar result
|
|
|
|
- In LanceDB specifically, store multiple data types:
|
|
. Text content (captions)
|
|
. Image file paths
|
|
. Metadata
|
|
. Vector embeddings
|
|
|
|
```python
|
|
_ = MultimodalLanceDB.from_text_image_pairs(
|
|
texts=updated_vid1_trans+vid2_trans,
|
|
image_paths=vid1_img_path+vid2_img_path,
|
|
embedding=BridgeTowerEmbeddings(),
|
|
metadatas=vid1_metadata+vid2_metadata,
|
|
connection=db,
|
|
table_name=TBL_NAME,
|
|
mode="overwrite",
|
|
)
|
|
```
|
|
# Gotchas and Solutions
|
|
Image Processing: When working with base64 encoded images, convert them to PIL.Image format before processing with BridgeTower
|
|
Model Selection: Using BridgeTowerForContrastiveLearning instead of PredictionGuard due to API access limitations
|
|
Model Size: BridgeTower model requires ~3.5GB download
|
|
Image Downloads: Some Flickr images may be unavailable; implement robust error handling
|
|
Token Decoding: BridgeTower contrastive learning model works with embeddings, not token predictions
|
|
Install from git+https://github.com/openai/whisper.git
|
|
|
|
# Install ffmepg using brew
|
|
```bash
|
|
brew install ffmpeg
|
|
brew link ffmpeg
|
|
```
|
|
|
|
|
|
# Learning and Skills
|
|
|
|
## Technical Skills:
|
|
|
|
Basic Machine learning and deep learning
|
|
Vector embeddings and similarity search
|
|
Multimodal data processing
|
|
|
|
## Framework & Library Expertise:
|
|
|
|
Hugging Face Transformers
|
|
Gradio UI development
|
|
LangChain integration (Basic)
|
|
PyTorch basics
|
|
LanceDB vector storage
|
|
|
|
## AI/ML Concepts:
|
|
|
|
Multimodal RAG system architecture
|
|
Vector embeddings and similarity search
|
|
Large Language Models (LLaVA)
|
|
Image-text pair processing
|
|
Dimensionality reduction techniques
|
|
|
|
|
|
## Multimedia Processing:
|
|
|
|
Video frame extraction
|
|
Audio transcription (Whisper)
|
|
Image processing (PIL)
|
|
Base64 encoding/decoding
|
|
WebVTT handling
|
|
|
|
## System Design:
|
|
|
|
Client-server architecture
|
|
API endpoint design
|
|
Data pipeline construction
|
|
Vector store implementation
|
|
Multimodal system integration
|
|
|
|
|