Skip to content

merge_pages drops text lines — multi-page DetectDocumentText aggregation loses content #2

Description

@cschanhniem

Bug

kernel.merge_pages only merges key_values, not lines:

def merge_pages(pages: list[Page]) -> Page:
    merged_kvs = [kv for p in pages for kv in p.key_values]
    return Page(width=pages[0].width, height=pages[0].height, key_values=merged_kvs)

If any caller (Pro or future OSS code) passes a multi-page text extraction result through merge_pages, all the lines from every page are silently dropped. Only key_values survive, so the merged Page would return zero text blocks for plain OCR use cases.

Expected

merge_pages should flatten both lines and key_values (and any other list fields on Page) from all input pages.

Suggested fix

def merge_pages(pages: list[Page]) -> Page:
    if len(pages) == 1:
        return pages[0]
    return Page(
        width=pages[0].width,
        height=pages[0].height,
        lines=[line for p in pages for line in p.lines],
        key_values=[kv for p in pages for kv in p.key_values],
    )

If Page gains more list fields in future, they should be merged here too — or the function could iterate over Page's dataclass fields automatically.

Impact

Currently merge_pages is called from the Pro plugin path. The OSS endpoints pass the full pages list to the response builders directly, so OSS core is not currently affected. But a future aggregation endpoint or external caller using the kernel API would silently lose text.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions