Spaces:

Agents-MCP-Hackathon
/

Intelligent_Content_Organizer

Running

File size: 6,523 Bytes

---
title: Intelligent Content Organizer MCP Agent
emoji: 😻
colorFrom: blue
colorTo: green
sdk: gradio
sdk_version: 5.32.0
app_file: app.py
pinned: false
license: mit
tags:
  - mcp-server-track
  - agent-demo-track
---

A powerful Model Context Protocol (MCP) server for intelligent content management with semantic search, summarization, and Q&A capabilities powered by **OpenAI, Mistral AI, and Anthropic Claude**.

## [📹 Read Article](https://huggingface.co/blog/Nihal2000/intelligent-content-organizer#empowering-your-data-building-an-intelligent-content-organizer-with-mistral-ai-and-the-model-context-protocol)

## 🎯 Features

### 🔧 MCP Tools Available

- **📄 Document Ingestion**: Upload and process documents (PDF, TXT, DOCX, images with OCR)
- **🔍 Semantic Search**: Find relevant content using natural language queries
- **📝 Summarization**: Generate summaries in different styles (concise, detailed, bullet points, executive)
- **🏷️ Tag Generation**: Automatically generate relevant tags for content
- **❓ Q&A System**: Ask questions about your documents using RAG (Retrieval-Augmented Generation)
- **📊 Categorization**: Classify content into predefined or custom categories
- **🔄 Batch Processing**: Process multiple documents at once
- **📈 Analytics**: Get insights and statistics about your content

### 🚀 Powered By

- **🧠 OpenAI GPT models** for powerful text generation and understanding
- **🔥 Mistral AI** for efficient text processing and analysis
- **🤖 Anthropic Claude** for advanced reasoning (available as a specific choice or fallback)
- **🔗 Sentence Transformers** for semantic embeddings
- **📚 FAISS** for fast similarity search
- **👁️ Mistral OCR** for image text extraction
- **🎨 Gradio** for the user interface and MCP server functionality

## LLM Strategy: The agent intelligently selects the best available LLM for most generative tasks when 'auto' model selection is used, prioritizing OpenAI, then Mistral, and finally Anthropic. Users can also specify a particular model family (e.g., 'gpt-', 'mistral-', 'claude-').

## 🎯 Key Features Implemented

1. **Full MCP Server**: Complete implementation with all tools exposed
2. **Multi-Modal Processing**: PDF, TXT, DOCX, and image processing with OCR
3. **Advanced Search**: Semantic search with FAISS, filtering, and multi-query support
4. **AI-Powered Features**: Summarization, tagging, categorization, Q&A with RAG
5. **Production Ready**: Error handling, logging, caching, rate limiting
6. **Gradio UI**: Beautiful web interface for testing and direct use
7. **OpenAi + Anthropic + Mistral**: LLM support with fallbacks

## 🎥 Demo Video

[📹 Watch the demo video](https://youtu.be/uBYIj_ntFRk)

*The demo shows the MCP server in action, demonstrating document ingestion, semantic search, and Q&A capabilities, utilizing the configured LLM providers.*

### Prerequisites

- Python 3.9+
- API keys for OpenAI and Mistral AI. An Anthropic API key.

- **MCP Tools Reference** (Tool parameters like model allow specifying "auto" or a specific model family like "gpt-", "mistral-", "claude-")

- **ingest_document**
  - Process and index a document for searching.
  - **Parameters:**
    - `file_path` (string): Path to the document file (e.g., an uploaded file path).
    - `file_type` (string, optional): File type/extension (e.g., ".pdf", ".txt"). If not provided, it's inferred from file_path.
  - **Returns:**
    - `success` (boolean): Whether the operation succeeded.
    - `document_id` (string): Unique identifier for the processed document.
    - `chunks_created` (integer): Number of text chunks created.
    - `message` (string): Human-readable result message.

- **semantic_search**
  - Search through indexed content using natural language.
  - **Parameters:**
    - `query` (string): Search query.
    - `top_k` (integer, optional): Number of results to return (default: 5).
    - `filters` (object, optional): Search filters (e.g., {"document_id": "some_id"}).
  - **Returns:**
    - `success` (boolean): Whether the search succeeded.
    - `results` (array of objects): Array of search results, each with content and score.
    - `total_results` (integer): Number of results found.

- **summarize_content**
  - Generate a summary of provided content.
  - **Parameters:**
    - `content` (string, optional): Text content to summarize.
    - `document_id` (string, optional): ID of document to summarize. (Either content or document_id must be provided).
    - `style` (string, optional): Summary style: "concise", "detailed", "bullet_points", "executive" (default: "concise").
    - `model` (string, optional): Specific LLM to use (e.g., "gpt-4o-mini", "mistral-large-latest", "auto"). Default: "auto".
  - **Returns:**
    - `success` (boolean): Whether summarization succeeded.
    - `summary` (string): Generated summary.
    - `original_length` (integer): Character length of original content.
    - `summary_length` (integer): Character length of summary.

- **generate_tags**
  - Generate relevant tags for content.
  - **Parameters:**
    - `content` (string, optional): Text content to tag.
    - `document_id` (string, optional): ID of document to tag. (Either content or document_id must be provided).
    - `max_tags` (integer, optional): Maximum number of tags (default: 5).
    - `model` (string, optional): Specific LLM to use. Default: "auto".
  - **Returns:**
    - `success` (boolean): Whether tag generation succeeded.
    - `tags` (array of strings): Array of generated tags.

- **answer_question**
  - Answer questions using RAG over your indexed content.
  - **Parameters:**
    - `question` (string): Question to answer.
    - `context_filter` (object, optional): Filters for context retrieval (e.g., {"document_id": "some_id"}).
    - `model` (string, optional): Specific LLM to use. Default: "auto".
  - **Returns:**
    - `success` (boolean): Whether question answering succeeded.
    - `answer` (string): Generated answer.
    - `sources` (array of objects): Source document chunks used for context, each with document_id, chunk_id, and content.
    - `confidence` (string, optional): Confidence level in the answer (LLM-dependent, might not always be present).

📊 Performance
Embedding Generation: ~100-500ms per document chunk
Search: <50ms for most queries
Summarization: 1-5s depending on content length
Memory Usage: ~200-500MB base + ~1MB per 1000 document chunks
Supported File Types: PDF, TXT, DOCX, PNG, JPG, JPEG