--- license: mit language: - en tags: - sentence-transformer - embeddings - mental-health - intent-classification pipeline_tag: feature-extraction base_model: sentence-transformers/all-MiniLM-L6-v2 --- # Intent Encoder (MindPadi) The `intent_encoder` is a Sentence Transformer model used in the MindPadi mental health assistant for **encoding user messages into dense embeddings**. These embeddings support intent classification, similarity search, and memory recall workflows. It plays a foundational role in the semantic understanding of user inputs across various MindPadi features. ## ๐Ÿง  Model Overview - **Architecture:** Sentence-BERT (`all-MiniLM-L6-v2` base) - **Task:** Sentence Embedding / Semantic Similarity - **Purpose:** Embed user queries for intent classification, vector search, and memory retrieval - **Size:** ~80M parameters - **Files:** - `config.json` - `pytorch_model.bin` or `model.safetensors` - `tokenizer.json`, `vocab.txt` - `1_Pooling/`, `2_Normalize/` (Sentence-BERT components) ## ๐Ÿงพ Intended Use ### โœ”๏ธ Primary Use Cases - Semantic embedding of user inputs for intent recognition - Matching new messages against known intent samples (`data/processed_intents.json`) - Supporting vector similarity in MongoDB Atlas Search or ChromaDB - Powering memory in LangGraph agentic workflows ### ๐Ÿšซ Not Recommended For - Direct intent classification (this model returns embeddings, not classes) - Use outside of NLP (e.g., image, audio) ## ๐Ÿงช Integration in MindPadi - `app/chatbot/intent_classifier.py`: Uses this model to compute sentence embeddings - `app/chatbot/intent_router.py`: Leverages vector similarity for intent matching - `database/vector_search.py`: Embeddings are stored or queried from MongoDB vector index - `app/utils/embedding_search.py`: Embeds utterances for real-time nearest-neighbor lookup ## ๐Ÿ‹๏ธ Training Details - **Base Model:** `sentence-transformers/all-MiniLM-L6-v2` (pretrained) - **Fine-tuning:** Optional domain-specific contrastive learning using pairs in `training/datasets/fallback_pairs.json` - **Script:** `training/fine_tune_encoder.py` (if fine-tuned) - **Tokenizer:** BERT-based WordPiece tokenizer - **Max Token Length:** 128 ## ๐Ÿ“ˆ Evaluation While this model is not evaluated via classification metrics, its **embedding quality** was assessed through: - **Cosine similarity tests** (intent embedding similarity) - **Intent clustering accuracy** with `KMeans` in vector space - **Recall@K** for correct intent retrieval - **Visualizations:** UMAP plots (`logs/intent_umap.png`) Results indicate: - High-quality clustering of semantically similar intents - ~91% Top-3 Recall for known intents ## ๐Ÿ’ฌ Example Usage ```python from sentence_transformers import SentenceTransformer model = SentenceTransformer("mindpadi/intent_encoder") texts = ["I want to talk to a therapist", "Book a session", "I'm feeling anxious"] embeddings = model.encode(texts) print(embeddings.shape) # (3, 384) ```` ## ๐Ÿงช Deployment (API Example) ```python import requests endpoint = "https://api-inference.huggingface.co/models/mindpadi/intent_encoder" headers = {"Authorization": f"Bearer "} payload = {"inputs": "I need help managing stress"} response = requests.post(endpoint, json=payload, headers=headers) embedding = response.json() ``` ## โš ๏ธ Limitations * English-only * Short, clean sentences work best (not optimized for long documents) * Does not directly return intent labels โ€” must be paired with clustering or classification logic * May yield ambiguous vectors for multi-intent or vague inputs ## ๐Ÿ“œ License MIT License โ€“ open for personal, academic, and commercial use with attribution. ## ๐Ÿ“ฌ Contact * **Project:** [MindPadi Mental Health Assistant](https://huggingface.co/mindpadi) * **Team:** MindPadi Developers * **Email:** \[[you@example.com](mailto:you@example.com)] * **GitHub:** \[[https://github.com/mindpadi](https://github.com/mindpadi)] *Last updated: May 2025*