Spaces:

88hours
/

multimodel-rag-chat-with-videos

Running

File size: 6,428 Bytes

ad022d3

---

title: multimodel-rag-chat-with-videos
app_file: app.py
sdk: gradio
sdk_version: 5.17.1
---


# Demo
## Sample Video
    - https://www.youtube.com/watch?v=kOEDG3j1bjs
    - https://www.youtube.com/watch?v=7Hcg-rLYwdM
## Questions
    - Event Horizon 
    - show me a group of astronauts, AStronaut name 
# ReArchitecture Multimodal RAG System Pipeline Journey
I ported it locally and isolated each concept into a step as Python runnable
It is simplified, refactored and bug-fixed now.
I migrated from Prediction Guard to HuggingFace.

[**Interactive Video Chat Demo and Multimodal RAG System Architecture**](https://learn.deeplearning.ai/courses/multimodal-rag-chat-with-videos/lesson/2/interactive-demo-and-multimodal-rag-system-architecture)  

### A multimodal AI system should be able to understand both text and video content.  

## Setup
```bash

python -m venv venv

source venv/bin/activate

```
For Fish
```bash

source venv/bin/activate.fish

```

## Step 1 - Learn Gradio (UI) (30 mins)  

Gradio is a powerful Python library for quickly building browser-based UIs. It supports hot reloading for fast development.  

### Key Concepts:  
- **fn**: The function wrapped by the UI.  
- **inputs**: The Gradio components used for input (should match function arguments).  
- **outputs**: The Gradio components used for output (should match return values).  

📖 [**Gradio Documentation**](https://www.gradio.app/docs/gradio/introduction)  

Gradio includes **30+ built-in components**.  

💡 **Tip**: For `inputs` and `outputs`, you can pass either:  
- The **component name** as a string (e.g., `"textbox"`)  
- An **instance of the component class** (e.g., `gr.Textbox()`)  

### Sharing Your Demo  
```python

demo.launch(share=True)  # Share your demo with just one extra parameter.

```

## Gradio Advanced Features  

### **Gradio.Blocks**  
Gradio provides `gr.Blocks`, a flexible way to design web apps with **custom layouts and complex interactions**:  
- Arrange components freely on the page.  
- Handle multiple data flows.  
- Use outputs as inputs for other components.  
- Dynamically update components based on user interaction.  

### **Gradio.ChatInterface**  
- Always set `type="messages"` in `gr.ChatInterface`.  
- The default (`type="tuples"`) is **deprecated** and will be removed in future versions.  
- For more UI flexibility, use `gr.ChatBot`.  
- `gr.ChatInterface` supports **Markdown** (not tested yet).  

---

## Step 2 - Learn Bridge Tower Embedding Model (Multimodal Learning) (15 mins)  

Developed in collaboration with Intel, this model maps image-caption pairs into **512-dimensional vectors**.  

### Measuring Similarity  
- **Cosine Similarity** → Measures how close images are in vector space (**efficient & commonly used**).  
- **Euclidean Distance** → Uses `cv2.NORM_L2` to compute similarity between two images.  

### Converting to 2D for Visualization  
- **UMAP** reduces 512D embeddings to **2D for display purposes**.  

## Preprocessing Videos for Multimodal RAG  

### **Case 1: WEBVTT → Extracting Text Segments from Video**  
    - Converts video + text into structured metadata.  
    - Splits content inhttps://www.youtube.com/watch?v=kOEDG3j1bjsto multiple segments.  

### **Case 2: Whisper (Small) → Video Only**  
    - Extracts **audio** → `model.transcribe()`.  
    - Applies `getSubs()` helper function to retrieve **WEBVTT** subtitles.  
    - Uses **Case 1** processing.  

### **Case 3: LvLM → Video + Silent/Music Extraction**  
    - Uses **Llava (LvLM model)** for **frame-based captioning**.  
    - Encodes each frame as a **Base64 image**.  
    - Extracts context and captions from video frames.  
    - Uses **Case 1** processing.  

# Step 4 - What is LLaVA?
LLaVA (Large Language-and-Vision Assistant), a large multimodal model that connects a vision encoder that doesn't just see images but understands them, reads the text embedded in them, and reasons about their context—all.

# Step 5 - what is a vector Store?
A vector store is a specialized database designed to:

- Store and manage high-dimensional vector data efficiently
- Perform similarity-based searches where K=1 returns the most similar result

- In LanceDB specifically, store multiple data types:
    . Text content (captions)

    . Image file paths

    . Metadata

    . Vector embeddings


```python

_ = MultimodalLanceDB.from_text_image_pairs(

    texts=updated_vid1_trans+vid2_trans,

    image_paths=vid1_img_path+vid2_img_path,

    embedding=BridgeTowerEmbeddings(),

    metadatas=vid1_metadata+vid2_metadata,

    connection=db,

    table_name=TBL_NAME,

    mode="overwrite", 

)

```
# Gotchas and Solutions
    Image Processing: When working with base64 encoded images, convert them to PIL.Image format before processing with BridgeTower

    Model Selection: Using BridgeTowerForContrastiveLearning instead of PredictionGuard due to API access limitations

    Model Size: BridgeTower model requires ~3.5GB download

    Image Downloads: Some Flickr images may be unavailable; implement robust error handling

    Token Decoding: BridgeTower contrastive learning model works with embeddings, not token predictions

    Install from git+https://github.com/openai/whisper.git 


# Install ffmepg using brew
    ```bash

        brew install ffmpeg

        brew link ffmpeg

    ```


        

# Learning and Skills


## Technical Skills:

    Basic Machine learning and deep learning

    Vector embeddings and similarity search

    Multimodal data processing

    

## Framework & Library Expertise:


    Hugging Face Transformers

    Gradio UI development

    LangChain integration (Basic)

    PyTorch basics

    LanceDB vector storage


## AI/ML Concepts:

    Multimodal RAG system architecture

    Vector embeddings and similarity search

    Large Language Models (LLaVA)

    Image-text pair processing

    Dimensionality reduction techniques



## Multimedia Processing:

    Video frame extraction

    Audio transcription (Whisper)

    Image processing (PIL)

    Base64 encoding/decoding

    WebVTT handling

    

## System Design:


    Client-server architecture

    API endpoint design

    Data pipeline construction

    Vector store implementation

    Multimodal system integration