Spaces:

hellorahulk
/

docling_free

Running

App Files Files Community

hellorahulk commited on Jan 23

Commit

0df5e58

1 Parent(s): 15fdcff

Add Hugging Face Space configuration

Browse files

Files changed (1) hide show

README.md +46 -70

README.md CHANGED Viewed

@@ -1,81 +1,57 @@
-# Dockling Parser
-A powerful multiformat document parsing module built on top of Docling. This module provides a unified interface for parsing various document formats including PDF, DOCX, TXT, HTML, and Markdown.
-## Features
-- Unified interface for multiple document formats
-- Rich metadata extraction
-- Structured content parsing
-- Format detection using MIME types
-- Error handling and validation
-- Type-safe using Pydantic models
-- Web interface using Gradio
-## Installation
-```bash
-pip install -r requirements.txt
-```
-## Usage
-### Python API
-```python
-from dockling_parser import DocumentParser
-# Initialize parser
-parser = DocumentParser()
-# Parse a document
-result = parser.parse("path/to/document.pdf")
-# Access parsed content
-print(result.content)  # Get main text content
-print(result.metadata)  # Get document metadata
-print(result.structured_content)  # Get structured content (sections, paragraphs, etc.)
-# Check format support
-is_supported = parser.supports_format("application/pdf")
-```
-### Web Interface
-The package includes a Gradio-based web interface for easy document parsing:
-```bash
-python app.py
-```
-This will launch a web interface with the following features:
-- Drag-and-drop document upload
-- Support for multiple document formats
-- Automatic format detection
-- Structured output display:
-  - Document content
-  - Metadata table
   - Section breakdown
   - Named entity recognition
   - Confidence scoring
-## Supported Formats
-- PDF (application/pdf)
-- DOCX (application/vnd.openxmlformats-officedocument.wordprocessingml.document)
-- Plain Text (text/plain)
-- HTML (text/html)
-- Markdown (text/markdown)
-## Error Handling
-The module provides specific exceptions for different error cases:
-- `UnsupportedFormatError`: When the document format is not supported
-- `ParseError`: When document parsing fails
-- `ValidationError`: When document validation fails
-- `EncodingError`: When document encoding issues occur
-## License
 MIT License

+---
+title: Smart Document Parser
+emoji: 📄
+colorFrom: blue
+colorTo: indigo
+sdk: gradio
+sdk_version: 4.0.0
+app_file: app.py
+pinned: false
+---
+# 📄 Smart Document Parser
+A powerful document parsing application that automatically extracts structured information from various document formats.
+## 🚀 Features
+- **Multiple Format Support**: PDF, DOCX, TXT, HTML, and Markdown
+- **Rich Information Extraction**:
+  - Document content with preserved formatting
+  - Comprehensive metadata
   - Section breakdown
   - Named entity recognition
+- **Smart Processing**:
+  - Automatic format detection
   - Confidence scoring
+  - Error handling
+## 🎯 How to Use
+1. **Upload Document**: Click the upload button or drag & drop your document
+2. **Process**: Click "Process Document"
+3. **View Results**: Explore the extracted information in different tabs:
+   - 📝 Content: Main document text
+   - 📊 Metadata: Document properties
+   - 📑 Sections: Document structure
+   - 🏷️ Entities: Named entities
+## 📋 Supported Formats
+- PDF Documents (*.pdf)
+- Word Documents (*.docx)
+- Text Files (*.txt)
+- HTML Files (*.html)
+- Markdown Files (*.md)
+## 🛠️ Technical Details
+Built with:
+- Docling: Advanced document processing
+- Gradio: Interactive web interface
+- Pydantic: Type-safe data handling
+- Hugging Face Spaces: Cloud deployment
+## 📝 License
 MIT License