File size: 6,523 Bytes
caf68e1 cc72373 caf68e1 4a0fab5 45c4909 4a0fab5 befe9e4 4a0fab5 45c4909 4a0fab5 45c4909 4a0fab5 15131ea 4a0fab5 15131ea 4a0fab5 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 |
---
title: Intelligent Content Organizer MCP Agent
emoji: π»
colorFrom: blue
colorTo: green
sdk: gradio
sdk_version: 5.32.0
app_file: app.py
pinned: false
license: mit
tags:
- mcp-server-track
- agent-demo-track
---
A powerful Model Context Protocol (MCP) server for intelligent content management with semantic search, summarization, and Q&A capabilities powered by **OpenAI, Mistral AI, and Anthropic Claude**.
## [πΉ Read Article](https://huggingface.co/blog/Nihal2000/intelligent-content-organizer#empowering-your-data-building-an-intelligent-content-organizer-with-mistral-ai-and-the-model-context-protocol)
## π― Features
### π§ MCP Tools Available
- **π Document Ingestion**: Upload and process documents (PDF, TXT, DOCX, images with OCR)
- **π Semantic Search**: Find relevant content using natural language queries
- **π Summarization**: Generate summaries in different styles (concise, detailed, bullet points, executive)
- **π·οΈ Tag Generation**: Automatically generate relevant tags for content
- **β Q&A System**: Ask questions about your documents using RAG (Retrieval-Augmented Generation)
- **π Categorization**: Classify content into predefined or custom categories
- **π Batch Processing**: Process multiple documents at once
- **π Analytics**: Get insights and statistics about your content
### π Powered By
- **π§ OpenAI GPT models** for powerful text generation and understanding
- **π₯ Mistral AI** for efficient text processing and analysis
- **π€ Anthropic Claude** for advanced reasoning (available as a specific choice or fallback)
- **π Sentence Transformers** for semantic embeddings
- **π FAISS** for fast similarity search
- **ποΈ Mistral OCR** for image text extraction
- **π¨ Gradio** for the user interface and MCP server functionality
## LLM Strategy: The agent intelligently selects the best available LLM for most generative tasks when 'auto' model selection is used, prioritizing OpenAI, then Mistral, and finally Anthropic. Users can also specify a particular model family (e.g., 'gpt-', 'mistral-', 'claude-').
## π― Key Features Implemented
1. **Full MCP Server**: Complete implementation with all tools exposed
2. **Multi-Modal Processing**: PDF, TXT, DOCX, and image processing with OCR
3. **Advanced Search**: Semantic search with FAISS, filtering, and multi-query support
4. **AI-Powered Features**: Summarization, tagging, categorization, Q&A with RAG
5. **Production Ready**: Error handling, logging, caching, rate limiting
6. **Gradio UI**: Beautiful web interface for testing and direct use
7. **OpenAi + Anthropic + Mistral**: LLM support with fallbacks
## π₯ Demo Video
[πΉ Watch the demo video](https://youtu.be/uBYIj_ntFRk)
*The demo shows the MCP server in action, demonstrating document ingestion, semantic search, and Q&A capabilities, utilizing the configured LLM providers.*
### Prerequisites
- Python 3.9+
- API keys for OpenAI and Mistral AI. An Anthropic API key.
- **MCP Tools Reference** (Tool parameters like model allow specifying "auto" or a specific model family like "gpt-", "mistral-", "claude-")
- **ingest_document**
- Process and index a document for searching.
- **Parameters:**
- `file_path` (string): Path to the document file (e.g., an uploaded file path).
- `file_type` (string, optional): File type/extension (e.g., ".pdf", ".txt"). If not provided, it's inferred from file_path.
- **Returns:**
- `success` (boolean): Whether the operation succeeded.
- `document_id` (string): Unique identifier for the processed document.
- `chunks_created` (integer): Number of text chunks created.
- `message` (string): Human-readable result message.
- **semantic_search**
- Search through indexed content using natural language.
- **Parameters:**
- `query` (string): Search query.
- `top_k` (integer, optional): Number of results to return (default: 5).
- `filters` (object, optional): Search filters (e.g., {"document_id": "some_id"}).
- **Returns:**
- `success` (boolean): Whether the search succeeded.
- `results` (array of objects): Array of search results, each with content and score.
- `total_results` (integer): Number of results found.
- **summarize_content**
- Generate a summary of provided content.
- **Parameters:**
- `content` (string, optional): Text content to summarize.
- `document_id` (string, optional): ID of document to summarize. (Either content or document_id must be provided).
- `style` (string, optional): Summary style: "concise", "detailed", "bullet_points", "executive" (default: "concise").
- `model` (string, optional): Specific LLM to use (e.g., "gpt-4o-mini", "mistral-large-latest", "auto"). Default: "auto".
- **Returns:**
- `success` (boolean): Whether summarization succeeded.
- `summary` (string): Generated summary.
- `original_length` (integer): Character length of original content.
- `summary_length` (integer): Character length of summary.
- **generate_tags**
- Generate relevant tags for content.
- **Parameters:**
- `content` (string, optional): Text content to tag.
- `document_id` (string, optional): ID of document to tag. (Either content or document_id must be provided).
- `max_tags` (integer, optional): Maximum number of tags (default: 5).
- `model` (string, optional): Specific LLM to use. Default: "auto".
- **Returns:**
- `success` (boolean): Whether tag generation succeeded.
- `tags` (array of strings): Array of generated tags.
- **answer_question**
- Answer questions using RAG over your indexed content.
- **Parameters:**
- `question` (string): Question to answer.
- `context_filter` (object, optional): Filters for context retrieval (e.g., {"document_id": "some_id"}).
- `model` (string, optional): Specific LLM to use. Default: "auto".
- **Returns:**
- `success` (boolean): Whether question answering succeeded.
- `answer` (string): Generated answer.
- `sources` (array of objects): Source document chunks used for context, each with document_id, chunk_id, and content.
- `confidence` (string, optional): Confidence level in the answer (LLM-dependent, might not always be present).
π Performance
Embedding Generation: ~100-500ms per document chunk
Search: <50ms for most queries
Summarization: 1-5s depending on content length
Memory Usage: ~200-500MB base + ~1MB per 1000 document chunks
Supported File Types: PDF, TXT, DOCX, PNG, JPG, JPEG
|