GenAIDevTOProd's picture
Update README.md
a03beaf verified
---
title: Reddit SemanticSearch Prototype
emoji: 🐨
colorFrom: purple
colorTo: indigo
sdk: gradio
sdk_version: 5.41.0
app_file: app.py
pinned: false
short_description: 'r/technology, r/gaming, r/programming etc search comments '
---
# Reddit Semantic Search (Prototype)
A lightweight semantic search engine built on Reddit comments using:
- **Word2Vec embeddings** (trained from scratch on selected subreddits)
- **FAISS** for fast vector indexing and retrieval
- **Gradio** for a user-friendly, Reddit-themed interface
> ⚠️ This is an independent prototype. Not affiliated with Reddit Inc.
---
## Dataset
- Source: [`HuggingFaceGECLM/REDDIT_comments`](https://huggingface.co/datasets/HuggingFaceGECLM/REDDIT_comments)
- Subreddits used:
- `askscience`, `gaming`, `technology`, `todayilearned`, `programming`
- Data was streamed using Hugging Face's `datasets` library and chunked using PySpark.
---
## Project Pipeline
1. **Data Loading & Chunking**
- Load subreddit splits individually using streaming
- Group every 5 comments into a single text chunk using PySpark
- Clean and tokenize text for training
2. **Training Word2Vec**
- Custom embeddings trained using `gensim`'s Word2Vec on cleaned comment chunks
3. **Vector Indexing (FAISS)**
- Each chunk embedded by averaging Word2Vec vectors of words
- Dense vectors indexed using `faiss.IndexFlatL2`
4. **Semantic Search App (Gradio)**
- Enter your query and select a subreddit filter
- Retrieves top 5 semantically similar comment chunks
- Built-in reranking logic can be added later
---
## Run the App
```bash
pip install -r requirements.txt
python app.py # or run the notebook
Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference