--- title: Reddit SemanticSearch Prototype emoji: 🐨 colorFrom: purple colorTo: indigo sdk: gradio sdk_version: 5.41.0 app_file: app.py pinned: false short_description: 'r/technology, r/gaming, r/programming etc search comments ' --- # Reddit Semantic Search (Prototype) A lightweight semantic search engine built on Reddit comments using: - **Word2Vec embeddings** (trained from scratch on selected subreddits) - **FAISS** for fast vector indexing and retrieval - **Gradio** for a user-friendly, Reddit-themed interface > ⚠️ This is an independent prototype. Not affiliated with Reddit Inc. --- ## Dataset - Source: [`HuggingFaceGECLM/REDDIT_comments`](https://huggingface.co/datasets/HuggingFaceGECLM/REDDIT_comments) - Subreddits used: - `askscience`, `gaming`, `technology`, `todayilearned`, `programming` - Data was streamed using Hugging Face's `datasets` library and chunked using PySpark. --- ## Project Pipeline 1. **Data Loading & Chunking** - Load subreddit splits individually using streaming - Group every 5 comments into a single text chunk using PySpark - Clean and tokenize text for training 2. **Training Word2Vec** - Custom embeddings trained using `gensim`'s Word2Vec on cleaned comment chunks 3. **Vector Indexing (FAISS)** - Each chunk embedded by averaging Word2Vec vectors of words - Dense vectors indexed using `faiss.IndexFlatL2` 4. **Semantic Search App (Gradio)** - Enter your query and select a subreddit filter - Retrieves top 5 semantically similar comment chunks - Built-in reranking logic can be added later --- ## Run the App ```bash pip install -r requirements.txt python app.py # or run the notebook Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference