File size: 6,428 Bytes
ad022d3
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
---

title: multimodel-rag-chat-with-videos
app_file: app.py
sdk: gradio
sdk_version: 5.17.1
---


# Demo
## Sample Video
    - https://www.youtube.com/watch?v=kOEDG3j1bjs
    - https://www.youtube.com/watch?v=7Hcg-rLYwdM
## Questions
    - Event Horizon 
    - show me a group of astronauts, AStronaut name 
# ReArchitecture Multimodal RAG System Pipeline Journey
I ported it locally and isolated each concept into a step as Python runnable
It is simplified, refactored and bug-fixed now.
I migrated from Prediction Guard to HuggingFace.

[**Interactive Video Chat Demo and Multimodal RAG System Architecture**](https://learn.deeplearning.ai/courses/multimodal-rag-chat-with-videos/lesson/2/interactive-demo-and-multimodal-rag-system-architecture)  

### A multimodal AI system should be able to understand both text and video content.  

## Setup
```bash

python -m venv venv

source venv/bin/activate

```
For Fish
```bash

source venv/bin/activate.fish

```

## Step 1 - Learn Gradio (UI) (30 mins)  

Gradio is a powerful Python library for quickly building browser-based UIs. It supports hot reloading for fast development.  

### Key Concepts:  
- **fn**: The function wrapped by the UI.  
- **inputs**: The Gradio components used for input (should match function arguments).  
- **outputs**: The Gradio components used for output (should match return values).  

📖 [**Gradio Documentation**](https://www.gradio.app/docs/gradio/introduction)  

Gradio includes **30+ built-in components**.  

💡 **Tip**: For `inputs` and `outputs`, you can pass either:  
- The **component name** as a string (e.g., `"textbox"`)  
- An **instance of the component class** (e.g., `gr.Textbox()`)  

### Sharing Your Demo  
```python

demo.launch(share=True)  # Share your demo with just one extra parameter.

```

## Gradio Advanced Features  

### **Gradio.Blocks**  
Gradio provides `gr.Blocks`, a flexible way to design web apps with **custom layouts and complex interactions**:  
- Arrange components freely on the page.  
- Handle multiple data flows.  
- Use outputs as inputs for other components.  
- Dynamically update components based on user interaction.  

### **Gradio.ChatInterface**  
- Always set `type="messages"` in `gr.ChatInterface`.  
- The default (`type="tuples"`) is **deprecated** and will be removed in future versions.  
- For more UI flexibility, use `gr.ChatBot`.  
- `gr.ChatInterface` supports **Markdown** (not tested yet).  

---

## Step 2 - Learn Bridge Tower Embedding Model (Multimodal Learning) (15 mins)  

Developed in collaboration with Intel, this model maps image-caption pairs into **512-dimensional vectors**.  

### Measuring Similarity  
- **Cosine Similarity** → Measures how close images are in vector space (**efficient & commonly used**).  
- **Euclidean Distance** → Uses `cv2.NORM_L2` to compute similarity between two images.  

### Converting to 2D for Visualization  
- **UMAP** reduces 512D embeddings to **2D for display purposes**.  

## Preprocessing Videos for Multimodal RAG  

### **Case 1: WEBVTT → Extracting Text Segments from Video**  
    - Converts video + text into structured metadata.  
    - Splits content inhttps://www.youtube.com/watch?v=kOEDG3j1bjsto multiple segments.  

### **Case 2: Whisper (Small) → Video Only**  
    - Extracts **audio**`model.transcribe()`.  
    - Applies `getSubs()` helper function to retrieve **WEBVTT** subtitles.  
    - Uses **Case 1** processing.  

### **Case 3: LvLM → Video + Silent/Music Extraction**  
    - Uses **Llava (LvLM model)** for **frame-based captioning**.  
    - Encodes each frame as a **Base64 image**.  
    - Extracts context and captions from video frames.  
    - Uses **Case 1** processing.  

# Step 4 - What is LLaVA?
LLaVA (Large Language-and-Vision Assistant), a large multimodal model that connects a vision encoder that doesn't just see images but understands them, reads the text embedded in them, and reasons about their context—all.

# Step 5 - what is a vector Store?
A vector store is a specialized database designed to:

- Store and manage high-dimensional vector data efficiently
- Perform similarity-based searches where K=1 returns the most similar result

- In LanceDB specifically, store multiple data types:
    . Text content (captions)

    . Image file paths

    . Metadata

    . Vector embeddings


```python

_ = MultimodalLanceDB.from_text_image_pairs(

    texts=updated_vid1_trans+vid2_trans,

    image_paths=vid1_img_path+vid2_img_path,

    embedding=BridgeTowerEmbeddings(),

    metadatas=vid1_metadata+vid2_metadata,

    connection=db,

    table_name=TBL_NAME,

    mode="overwrite", 

)

```
# Gotchas and Solutions
    Image Processing: When working with base64 encoded images, convert them to PIL.Image format before processing with BridgeTower

    Model Selection: Using BridgeTowerForContrastiveLearning instead of PredictionGuard due to API access limitations

    Model Size: BridgeTower model requires ~3.5GB download

    Image Downloads: Some Flickr images may be unavailable; implement robust error handling

    Token Decoding: BridgeTower contrastive learning model works with embeddings, not token predictions

    Install from git+https://github.com/openai/whisper.git 


# Install ffmepg using brew
    ```bash

        brew install ffmpeg

        brew link ffmpeg

    ```


        

# Learning and Skills


## Technical Skills:

    Basic Machine learning and deep learning

    Vector embeddings and similarity search

    Multimodal data processing

    

## Framework & Library Expertise:


    Hugging Face Transformers

    Gradio UI development

    LangChain integration (Basic)

    PyTorch basics

    LanceDB vector storage


## AI/ML Concepts:

    Multimodal RAG system architecture

    Vector embeddings and similarity search

    Large Language Models (LLaVA)

    Image-text pair processing

    Dimensionality reduction techniques



## Multimedia Processing:

    Video frame extraction

    Audio transcription (Whisper)

    Image processing (PIL)

    Base64 encoding/decoding

    WebVTT handling

    

## System Design:


    Client-server architecture

    API endpoint design

    Data pipeline construction

    Vector store implementation

    Multimodal system integration