Skip to content

Link exchange #7

@MaxTEX310

Description

@MaxTEX310

Your team's work is very good. Can you exchange links in realme? I have developed a UniOne dataset aimed at solving the problem of document parsing upstream and downstream task sharing. Looking forward to your reply.

Currently, document parsing technology is facing the problem of "task silos" caused by fragmented datasets. With the widespread application of deep learning in document understanding, the construction of a unified and shared dataset for collaborative development of upstream and downstream tasks has become an inevitable trend. We have built the first UniOne document dataset that supports the parsing of upstream and downstream tasks. By systematically integrating tasks such as layout analysis, text line detection and recognition, and table recognition, we have innovatively established a cross-task annotation dataset. This dataset: (1) at the layout analysis level, includes 236,790 paragraph-level annotations across 14,481 pages, covering 11 semantic categories; (2) at the text line detection level, based on the layout analysis data, further adds fine-grained annotations for 340,890 lines in 198,901 text paragraphs; (3) for complex scenarios, it introduces 8,000 challenging handwritten mathematical expressions, 18,717 printed mathematical formulas, 26,849 formula texts with unified recognition annotations, and 1,169 tables extracted from papers to fully support document content parsing. To our knowledge, this dataset is the first to achieve cross-task joint modeling from macro layouts to micro elements, breaking through the limitations of traditional single-task datasets and providing essential infrastructure for building the next generation of intelligent document parsing systems.

https://github.com/MaxTEX310/UniOne

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions