|
--- |
|
title: Reddit SemanticSearch Prototype |
|
emoji: 🐨 |
|
colorFrom: purple |
|
colorTo: indigo |
|
sdk: gradio |
|
sdk_version: 5.41.0 |
|
app_file: app.py |
|
pinned: false |
|
short_description: 'r/technology, r/gaming, r/programming etc search comments ' |
|
--- |
|
|
|
# Reddit Semantic Search (Prototype) |
|
|
|
A lightweight semantic search engine built on Reddit comments using: |
|
- **Word2Vec embeddings** (trained from scratch on selected subreddits) |
|
- **FAISS** for fast vector indexing and retrieval |
|
- **Gradio** for a user-friendly, Reddit-themed interface |
|
|
|
> ⚠️ This is an independent prototype. Not affiliated with Reddit Inc. |
|
|
|
--- |
|
|
|
## Dataset |
|
|
|
- Source: [`HuggingFaceGECLM/REDDIT_comments`](https://huggingface.co/datasets/HuggingFaceGECLM/REDDIT_comments) |
|
- Subreddits used: |
|
- `askscience`, `gaming`, `technology`, `todayilearned`, `programming` |
|
- Data was streamed using Hugging Face's `datasets` library and chunked using PySpark. |
|
|
|
--- |
|
|
|
## Project Pipeline |
|
|
|
1. **Data Loading & Chunking** |
|
- Load subreddit splits individually using streaming |
|
- Group every 5 comments into a single text chunk using PySpark |
|
- Clean and tokenize text for training |
|
|
|
2. **Training Word2Vec** |
|
- Custom embeddings trained using `gensim`'s Word2Vec on cleaned comment chunks |
|
|
|
3. **Vector Indexing (FAISS)** |
|
- Each chunk embedded by averaging Word2Vec vectors of words |
|
- Dense vectors indexed using `faiss.IndexFlatL2` |
|
|
|
4. **Semantic Search App (Gradio)** |
|
- Enter your query and select a subreddit filter |
|
- Retrieves top 5 semantically similar comment chunks |
|
- Built-in reranking logic can be added later |
|
|
|
--- |
|
|
|
## Run the App |
|
|
|
```bash |
|
pip install -r requirements.txt |
|
python app.py # or run the notebook |
|
|
|
|
|
Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference |
|
|