Spaces:
Running
Running
Commit
Β·
0df5e58
1
Parent(s):
15fdcff
Add Hugging Face Space configuration
Browse files
README.md
CHANGED
@@ -1,81 +1,57 @@
|
|
1 |
-
|
2 |
-
|
3 |
-
|
4 |
-
|
5 |
-
|
6 |
-
|
7 |
-
|
8 |
-
|
9 |
-
|
10 |
-
|
11 |
-
|
12 |
-
|
13 |
-
|
14 |
-
|
15 |
-
|
16 |
-
|
17 |
-
|
18 |
-
|
19 |
-
|
20 |
-
|
21 |
-
|
22 |
-
|
23 |
-
### Python API
|
24 |
-
|
25 |
-
```python
|
26 |
-
from dockling_parser import DocumentParser
|
27 |
-
|
28 |
-
# Initialize parser
|
29 |
-
parser = DocumentParser()
|
30 |
-
|
31 |
-
# Parse a document
|
32 |
-
result = parser.parse("path/to/document.pdf")
|
33 |
-
|
34 |
-
# Access parsed content
|
35 |
-
print(result.content) # Get main text content
|
36 |
-
print(result.metadata) # Get document metadata
|
37 |
-
print(result.structured_content) # Get structured content (sections, paragraphs, etc.)
|
38 |
-
|
39 |
-
# Check format support
|
40 |
-
is_supported = parser.supports_format("application/pdf")
|
41 |
-
```
|
42 |
-
|
43 |
-
### Web Interface
|
44 |
-
|
45 |
-
The package includes a Gradio-based web interface for easy document parsing:
|
46 |
-
|
47 |
-
```bash
|
48 |
-
python app.py
|
49 |
-
```
|
50 |
-
|
51 |
-
This will launch a web interface with the following features:
|
52 |
-
- Drag-and-drop document upload
|
53 |
-
- Support for multiple document formats
|
54 |
-
- Automatic format detection
|
55 |
-
- Structured output display:
|
56 |
-
- Document content
|
57 |
-
- Metadata table
|
58 |
- Section breakdown
|
59 |
- Named entity recognition
|
|
|
|
|
60 |
- Confidence scoring
|
|
|
|
|
|
|
61 |
|
62 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
63 |
|
64 |
-
|
65 |
-
- DOCX (application/vnd.openxmlformats-officedocument.wordprocessingml.document)
|
66 |
-
- Plain Text (text/plain)
|
67 |
-
- HTML (text/html)
|
68 |
-
- Markdown (text/markdown)
|
69 |
|
70 |
-
|
|
|
|
|
|
|
|
|
71 |
|
72 |
-
|
73 |
|
74 |
-
|
75 |
-
-
|
76 |
-
-
|
77 |
-
-
|
|
|
78 |
|
79 |
-
## License
|
80 |
|
81 |
MIT License
|
|
|
1 |
+
---
|
2 |
+
title: Smart Document Parser
|
3 |
+
emoji: π
|
4 |
+
colorFrom: blue
|
5 |
+
colorTo: indigo
|
6 |
+
sdk: gradio
|
7 |
+
sdk_version: 4.0.0
|
8 |
+
app_file: app.py
|
9 |
+
pinned: false
|
10 |
+
---
|
11 |
+
|
12 |
+
# π Smart Document Parser
|
13 |
+
|
14 |
+
A powerful document parsing application that automatically extracts structured information from various document formats.
|
15 |
+
|
16 |
+
## π Features
|
17 |
+
|
18 |
+
- **Multiple Format Support**: PDF, DOCX, TXT, HTML, and Markdown
|
19 |
+
- **Rich Information Extraction**:
|
20 |
+
- Document content with preserved formatting
|
21 |
+
- Comprehensive metadata
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
22 |
- Section breakdown
|
23 |
- Named entity recognition
|
24 |
+
- **Smart Processing**:
|
25 |
+
- Automatic format detection
|
26 |
- Confidence scoring
|
27 |
+
- Error handling
|
28 |
+
|
29 |
+
## π― How to Use
|
30 |
|
31 |
+
1. **Upload Document**: Click the upload button or drag & drop your document
|
32 |
+
2. **Process**: Click "Process Document"
|
33 |
+
3. **View Results**: Explore the extracted information in different tabs:
|
34 |
+
- π Content: Main document text
|
35 |
+
- π Metadata: Document properties
|
36 |
+
- π Sections: Document structure
|
37 |
+
- π·οΈ Entities: Named entities
|
38 |
|
39 |
+
## π Supported Formats
|
|
|
|
|
|
|
|
|
40 |
|
41 |
+
- PDF Documents (*.pdf)
|
42 |
+
- Word Documents (*.docx)
|
43 |
+
- Text Files (*.txt)
|
44 |
+
- HTML Files (*.html)
|
45 |
+
- Markdown Files (*.md)
|
46 |
|
47 |
+
## π οΈ Technical Details
|
48 |
|
49 |
+
Built with:
|
50 |
+
- Docling: Advanced document processing
|
51 |
+
- Gradio: Interactive web interface
|
52 |
+
- Pydantic: Type-safe data handling
|
53 |
+
- Hugging Face Spaces: Cloud deployment
|
54 |
|
55 |
+
## π License
|
56 |
|
57 |
MIT License
|