File size: 6,428 Bytes
ad022d3 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 |
---
title: multimodel-rag-chat-with-videos
app_file: app.py
sdk: gradio
sdk_version: 5.17.1
---
# Demo
## Sample Video
- https://www.youtube.com/watch?v=kOEDG3j1bjs
- https://www.youtube.com/watch?v=7Hcg-rLYwdM
## Questions
- Event Horizon
- show me a group of astronauts, AStronaut name
# ReArchitecture Multimodal RAG System Pipeline Journey
I ported it locally and isolated each concept into a step as Python runnable
It is simplified, refactored and bug-fixed now.
I migrated from Prediction Guard to HuggingFace.
[**Interactive Video Chat Demo and Multimodal RAG System Architecture**](https://learn.deeplearning.ai/courses/multimodal-rag-chat-with-videos/lesson/2/interactive-demo-and-multimodal-rag-system-architecture)
### A multimodal AI system should be able to understand both text and video content.
## Setup
```bash
python -m venv venv
source venv/bin/activate
```
For Fish
```bash
source venv/bin/activate.fish
```
## Step 1 - Learn Gradio (UI) (30 mins)
Gradio is a powerful Python library for quickly building browser-based UIs. It supports hot reloading for fast development.
### Key Concepts:
- **fn**: The function wrapped by the UI.
- **inputs**: The Gradio components used for input (should match function arguments).
- **outputs**: The Gradio components used for output (should match return values).
📖 [**Gradio Documentation**](https://www.gradio.app/docs/gradio/introduction)
Gradio includes **30+ built-in components**.
💡 **Tip**: For `inputs` and `outputs`, you can pass either:
- The **component name** as a string (e.g., `"textbox"`)
- An **instance of the component class** (e.g., `gr.Textbox()`)
### Sharing Your Demo
```python
demo.launch(share=True) # Share your demo with just one extra parameter.
```
## Gradio Advanced Features
### **Gradio.Blocks**
Gradio provides `gr.Blocks`, a flexible way to design web apps with **custom layouts and complex interactions**:
- Arrange components freely on the page.
- Handle multiple data flows.
- Use outputs as inputs for other components.
- Dynamically update components based on user interaction.
### **Gradio.ChatInterface**
- Always set `type="messages"` in `gr.ChatInterface`.
- The default (`type="tuples"`) is **deprecated** and will be removed in future versions.
- For more UI flexibility, use `gr.ChatBot`.
- `gr.ChatInterface` supports **Markdown** (not tested yet).
---
## Step 2 - Learn Bridge Tower Embedding Model (Multimodal Learning) (15 mins)
Developed in collaboration with Intel, this model maps image-caption pairs into **512-dimensional vectors**.
### Measuring Similarity
- **Cosine Similarity** → Measures how close images are in vector space (**efficient & commonly used**).
- **Euclidean Distance** → Uses `cv2.NORM_L2` to compute similarity between two images.
### Converting to 2D for Visualization
- **UMAP** reduces 512D embeddings to **2D for display purposes**.
## Preprocessing Videos for Multimodal RAG
### **Case 1: WEBVTT → Extracting Text Segments from Video**
- Converts video + text into structured metadata.
- Splits content inhttps://www.youtube.com/watch?v=kOEDG3j1bjsto multiple segments.
### **Case 2: Whisper (Small) → Video Only**
- Extracts **audio** → `model.transcribe()`.
- Applies `getSubs()` helper function to retrieve **WEBVTT** subtitles.
- Uses **Case 1** processing.
### **Case 3: LvLM → Video + Silent/Music Extraction**
- Uses **Llava (LvLM model)** for **frame-based captioning**.
- Encodes each frame as a **Base64 image**.
- Extracts context and captions from video frames.
- Uses **Case 1** processing.
# Step 4 - What is LLaVA?
LLaVA (Large Language-and-Vision Assistant), a large multimodal model that connects a vision encoder that doesn't just see images but understands them, reads the text embedded in them, and reasons about their context—all.
# Step 5 - what is a vector Store?
A vector store is a specialized database designed to:
- Store and manage high-dimensional vector data efficiently
- Perform similarity-based searches where K=1 returns the most similar result
- In LanceDB specifically, store multiple data types:
. Text content (captions)
. Image file paths
. Metadata
. Vector embeddings
```python
_ = MultimodalLanceDB.from_text_image_pairs(
texts=updated_vid1_trans+vid2_trans,
image_paths=vid1_img_path+vid2_img_path,
embedding=BridgeTowerEmbeddings(),
metadatas=vid1_metadata+vid2_metadata,
connection=db,
table_name=TBL_NAME,
mode="overwrite",
)
```
# Gotchas and Solutions
Image Processing: When working with base64 encoded images, convert them to PIL.Image format before processing with BridgeTower
Model Selection: Using BridgeTowerForContrastiveLearning instead of PredictionGuard due to API access limitations
Model Size: BridgeTower model requires ~3.5GB download
Image Downloads: Some Flickr images may be unavailable; implement robust error handling
Token Decoding: BridgeTower contrastive learning model works with embeddings, not token predictions
Install from git+https://github.com/openai/whisper.git
# Install ffmepg using brew
```bash
brew install ffmpeg
brew link ffmpeg
```
# Learning and Skills
## Technical Skills:
Basic Machine learning and deep learning
Vector embeddings and similarity search
Multimodal data processing
## Framework & Library Expertise:
Hugging Face Transformers
Gradio UI development
LangChain integration (Basic)
PyTorch basics
LanceDB vector storage
## AI/ML Concepts:
Multimodal RAG system architecture
Vector embeddings and similarity search
Large Language Models (LLaVA)
Image-text pair processing
Dimensionality reduction techniques
## Multimedia Processing:
Video frame extraction
Audio transcription (Whisper)
Image processing (PIL)
Base64 encoding/decoding
WebVTT handling
## System Design:
Client-server architecture
API endpoint design
Data pipeline construction
Vector store implementation
Multimodal system integration
|