---
title: Reddit SemanticSearch Prototype
emoji: 🐨
colorFrom: purple
colorTo: indigo
sdk: gradio
sdk_version: 5.41.0
app_file: app.py
pinned: false
short_description: 'r/technology, r/gaming, r/programming etc search comments '
---

# Reddit Semantic Search (Prototype)

A lightweight semantic search engine built on Reddit comments using:
- **Word2Vec embeddings** (trained from scratch on selected subreddits)
- **FAISS** for fast vector indexing and retrieval
- **Gradio** for a user-friendly, Reddit-themed interface

> ⚠️ This is an independent prototype. Not affiliated with Reddit Inc.

---

## Dataset

- Source: [`HuggingFaceGECLM/REDDIT_comments`](https://huggingface.co/datasets/HuggingFaceGECLM/REDDIT_comments)
- Subreddits used:
  - `askscience`, `gaming`, `technology`, `todayilearned`, `programming`
- Data was streamed using Hugging Face's `datasets` library and chunked using PySpark.

---

## Project Pipeline

1. **Data Loading & Chunking**
   - Load subreddit splits individually using streaming
   - Group every 5 comments into a single text chunk using PySpark
   - Clean and tokenize text for training

2. **Training Word2Vec**
   - Custom embeddings trained using `gensim`'s Word2Vec on cleaned comment chunks

3. **Vector Indexing (FAISS)**
   - Each chunk embedded by averaging Word2Vec vectors of words
   - Dense vectors indexed using `faiss.IndexFlatL2`

4. **Semantic Search App (Gradio)**
   - Enter your query and select a subreddit filter
   - Retrieves top 5 semantically similar comment chunks
   - Built-in reranking logic can be added later

---

## Run the App

```bash
pip install -r requirements.txt
python app.py  # or run the notebook


Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference