sc22mc
/

DocFusion

MingxuChai commited on Jun 1

Commit

30a3dde

verified ·

1 Parent(s): 71a94ba

Update README.md

Files changed (1) hide show

README.md CHANGED Viewed

@@ -4,7 +4,7 @@ pipeline_tag: image-text-to-text
 ---
 ### DocFusion: A Unified Framework for Document Parsing Tasks
-Document parsing is essential for analyzing complex document structures and extracting fine-grained information, supporting numerous downstream applications. However, existing methods often require integrating multiple independent models to handle various parsing tasks, leading to high complexity and maintenance overhead. To address this, we propose DocFusion, a lightweight generative model with only 0.28B parameters. It unifies task representations and achieves collaborative training through an improved objective function. Experiments reveal and leverage the mutually beneficial interaction among recognition tasks, and integrating recognition data significantly enhances detection performance. The final results demonstrate that DocFusion achieves state-of-the-art (SOTA) performance across four key tasks.
 Resources and Technical Documentation:
 + [Technical Report](https://arxiv.org/abs/2412.12505)

 ---
 ### DocFusion: A Unified Framework for Document Parsing Tasks
+Document parsing involves layout element detection and recognition, essential for extracting information. However, existing methods often employ multiple models for these tasks, leading to increased system complexity and maintenance overhead. While some models attempt to unify detection and recognition, they often fail to address the intrinsic differences in data representations, thereby limiting performance in document processing. Our research reveals that recognition relies on discrete tokens, whereas detection relies on continuous coordinates, leading to challenges in gradient updates and optimization. To bridge this gap, we propose the Gaussian-Kernel CrossEntropy Loss (GK-CEL), enabling generative frameworks to handle both tasks simultaneously. Building upon GK-CEL, we propose DocFusion, a unified document parsing model with only 0.28B parameters. Additionally, we construct the DocLatex-1.6M dataset to provide high-quality training support. Experimental results show that DocFusion, equipped with GK-CEL, performs competitively across four core document parsing tasks, validating the effectiveness of our unified approach.
 Resources and Technical Documentation:
 + [Technical Report](https://arxiv.org/abs/2412.12505)