hellorahulk commited on
Commit
0df5e58
Β·
1 Parent(s): 15fdcff

Add Hugging Face Space configuration

Browse files
Files changed (1) hide show
  1. README.md +46 -70
README.md CHANGED
@@ -1,81 +1,57 @@
1
- # Dockling Parser
2
-
3
- A powerful multiformat document parsing module built on top of Docling. This module provides a unified interface for parsing various document formats including PDF, DOCX, TXT, HTML, and Markdown.
4
-
5
- ## Features
6
-
7
- - Unified interface for multiple document formats
8
- - Rich metadata extraction
9
- - Structured content parsing
10
- - Format detection using MIME types
11
- - Error handling and validation
12
- - Type-safe using Pydantic models
13
- - Web interface using Gradio
14
-
15
- ## Installation
16
-
17
- ```bash
18
- pip install -r requirements.txt
19
- ```
20
-
21
- ## Usage
22
-
23
- ### Python API
24
-
25
- ```python
26
- from dockling_parser import DocumentParser
27
-
28
- # Initialize parser
29
- parser = DocumentParser()
30
-
31
- # Parse a document
32
- result = parser.parse("path/to/document.pdf")
33
-
34
- # Access parsed content
35
- print(result.content) # Get main text content
36
- print(result.metadata) # Get document metadata
37
- print(result.structured_content) # Get structured content (sections, paragraphs, etc.)
38
-
39
- # Check format support
40
- is_supported = parser.supports_format("application/pdf")
41
- ```
42
-
43
- ### Web Interface
44
-
45
- The package includes a Gradio-based web interface for easy document parsing:
46
-
47
- ```bash
48
- python app.py
49
- ```
50
-
51
- This will launch a web interface with the following features:
52
- - Drag-and-drop document upload
53
- - Support for multiple document formats
54
- - Automatic format detection
55
- - Structured output display:
56
- - Document content
57
- - Metadata table
58
  - Section breakdown
59
  - Named entity recognition
 
 
60
  - Confidence scoring
 
 
 
61
 
62
- ## Supported Formats
 
 
 
 
 
 
63
 
64
- - PDF (application/pdf)
65
- - DOCX (application/vnd.openxmlformats-officedocument.wordprocessingml.document)
66
- - Plain Text (text/plain)
67
- - HTML (text/html)
68
- - Markdown (text/markdown)
69
 
70
- ## Error Handling
 
 
 
 
71
 
72
- The module provides specific exceptions for different error cases:
73
 
74
- - `UnsupportedFormatError`: When the document format is not supported
75
- - `ParseError`: When document parsing fails
76
- - `ValidationError`: When document validation fails
77
- - `EncodingError`: When document encoding issues occur
 
78
 
79
- ## License
80
 
81
  MIT License
 
1
+ ---
2
+ title: Smart Document Parser
3
+ emoji: πŸ“„
4
+ colorFrom: blue
5
+ colorTo: indigo
6
+ sdk: gradio
7
+ sdk_version: 4.0.0
8
+ app_file: app.py
9
+ pinned: false
10
+ ---
11
+
12
+ # πŸ“„ Smart Document Parser
13
+
14
+ A powerful document parsing application that automatically extracts structured information from various document formats.
15
+
16
+ ## πŸš€ Features
17
+
18
+ - **Multiple Format Support**: PDF, DOCX, TXT, HTML, and Markdown
19
+ - **Rich Information Extraction**:
20
+ - Document content with preserved formatting
21
+ - Comprehensive metadata
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
22
  - Section breakdown
23
  - Named entity recognition
24
+ - **Smart Processing**:
25
+ - Automatic format detection
26
  - Confidence scoring
27
+ - Error handling
28
+
29
+ ## 🎯 How to Use
30
 
31
+ 1. **Upload Document**: Click the upload button or drag & drop your document
32
+ 2. **Process**: Click "Process Document"
33
+ 3. **View Results**: Explore the extracted information in different tabs:
34
+ - πŸ“ Content: Main document text
35
+ - πŸ“Š Metadata: Document properties
36
+ - πŸ“‘ Sections: Document structure
37
+ - 🏷️ Entities: Named entities
38
 
39
+ ## πŸ“‹ Supported Formats
 
 
 
 
40
 
41
+ - PDF Documents (*.pdf)
42
+ - Word Documents (*.docx)
43
+ - Text Files (*.txt)
44
+ - HTML Files (*.html)
45
+ - Markdown Files (*.md)
46
 
47
+ ## πŸ› οΈ Technical Details
48
 
49
+ Built with:
50
+ - Docling: Advanced document processing
51
+ - Gradio: Interactive web interface
52
+ - Pydantic: Type-safe data handling
53
+ - Hugging Face Spaces: Cloud deployment
54
 
55
+ ## πŸ“ License
56
 
57
  MIT License