M3DocDep

Multi-modal, Multi-page, Multi-document Dependency Chunking
with Large Vision-Language Models

Joongmin Shin1, Jeongbae Park1, Jaehyung Seo2†, Heuiseok Lim1,3†
1Human-inspired AI Research, Korea University   2CSE, Konkuk University   3CSE, Korea University
Corresponding authors
CVPR 2026
M3DocDep pipeline overview

Figure 1. The M3DocDep pipeline. (a) SharedDet produces global document blocks via document parsing + OCR in a unified coordinate frame. (b) A frozen LVLM yields page multi-modal tokens; the SoftROI embedder pools tokens inside each block with boundary-aware weighting. (c) A biaffine head scores parent candidates; an MST decoder returns a globally valid dependency tree. (d) Tree-guided chunking traverses section subtrees and binds figures/tables to their captions.

Abstract

In large-scale industrial documents with scanned images, complex layouts, and multiple pages, the effectiveness of retrieval-augmented generation (RAG) is highly dependent on chunking quality. Existing text-centric chunkers overlook the visual and structural cues present in real-world documents, leading to redundant or ambiguous chunks that impair retrieval and answer accuracy.

We propose M3DocDep, which integrates (i) SharedDet for normalizing document parsing and OCR outputs into a document-level frame, (ii) multi-modal block embeddings with boundary-aware SoftROI, (iii) global document-tree reconstruction via biaffine scoring, and (iv) structure-aware dependency chunking that preserves boundaries and reduces redundancy.

M3DocDep achieves consistent gains across both Document Hierarchical Parsing (DHP) and corpus-level RAG evaluations, improving STEDS by +28.5–39.6%, retrieval nDCG by +1.1–15.3%, and QA ANLS by +4.5–15.3%. These results demonstrate that modeling document-level dependencies with multi-modal, structure-aware chunking improves RAG performance on long, multi-page industrial documents.

Contributions

1
LVLM-based dependency scoring framework. Multi-modal block embeddings from a frozen LVLM, combined with parent–child edge scoring, recover a global document tree — including long-range, cross-page parent–child relations.
2
Tree-guided structure-aware chunking. Chunks are assembled along recovered section subtrees, preserving figure/table–caption bindings and annotating each chunk with section-path metadata — yielding retrieval units compatible with corpus-level RAG.
3
Validation under a shared evaluation protocol. With the same SharedDet blocks, the same chunk budget, and the same retrievers and readers — and robustness checks under DP, OCR, and embedding swaps — M3DocDep consistently improves hierarchy recovery, retrieval quality, and QA accuracy.

Headline results

Relative improvement over the strongest chunking baseline (MultiDocFusion).

+10.6%
avg retrieval nDCG gain
range +1.1 to +15.3% (4 corpora)
+9.8%
avg QA ANLS gain
range +4.5 to +15.3% (4 corpora)
~5×
STEDS over plain LVLM
on document hierarchical parsing

Method

M3DocDep treats chunking as a dependency-recovery problem. Before chunking, we reconstruct a global document dependency tree using a vision-language model, then assemble chunks from coherent section subtrees of that tree. The pipeline has four stages:

(a) SharedDet (DP + OCR)

Run document parsing and OCR once per document to obtain a stable set of layout blocks — titles, headers, text, tables, figures, captions — in a unified coordinate frame. These Global Document Blocks are shared across all downstream models, ensuring fair component-wise comparison.

(b) LVLM-based multi-modal block embedding

Each page passes through a frozen vision-language model (e.g., Qwen2.5-VL, LLaVA-OneVision-1.5, InternVL-3.5) to produce Page Multi-modal Tokens. A SoftROI Embedder then pools tokens inside each block with a boundary-aware weighting that is robust to box jitter and OCR noise, yielding a multi-modal embedding per block.

(c) Global Document Dependency Parsing

For each block, a biaffine scoring head scores parent candidates over the block embeddings, with a header-centric prior that restricts candidates to plausible attachments. An MST decoder then turns the edge scores into a globally valid dependency tree — single root, single parent, acyclic, including cross-page links. This replaces the long-form sequence generation used by SFT-based LVLMs.

(d) Structure-aware dependency chunking

We traverse the recovered tree by section subtree (DFS), force figures and tables to stay with their captions via tree-link binding, and annotate every chunk with its section path, page range, and constituent block IDs. The resulting chunks align retrieval units with true semantic and hierarchical structure.

Qualitative example

A real case from the paper: a five-page physics paper on photon trajectories in the equatorial plane of a Kerr black hole. Fig. 1 sits on page 2; its caption continues across a column boundary. We trace M3DocDep end-to-end — from raw multi-page input, to the recovered dependency tree, to the structure-aware chunk it emits.

Input. A 5-page physics paper. Fig. 1 appears on page 2; the surrounding section content (title, intro, body paragraphs, caption) is scattered across pages 1–3. Naive chunkers slice this stack by tokens or sentence boundaries — losing the figure–caption binding the first time a page break gets in the way.

The 5-page input document with page 2 highlighted

Dependency tree. M3DocDep treats every layout block (title, section-title, text, figure, figure-caption) as a node and learns parent–child edges across the whole document — including cross-page links. The orange Key Path highlights the dependency chain this chunk follows: ROOT → 1:title → 17:section-title → 19:figure → 20:figure-caption.

Recovered dependency tree with Key Path highlighted

Output. The structure-aware chunk M3DocDep emits. It records the section path (# Title › ## 1. Intro), the page range spanned (2), the figure block with its bound caption, and the explicit dependency edge (19:figure → 20:caption). This is the retrieval unit used at RAG time — the reader sees the figure with its caption and its enclosing section all in one shot.

The final structure-aware chunk card emitted by M3DocDep

Figures reproduced from the paper's supplementary qualitative example.

Try it: Compare actual chunks per baseline

Reproduced from the supplementary's Table on chunking comparisons: for the same Kerr black hole input, each method produces its own 3 chunks. Switch baselines to see how M3DocDep's structure-aware chunks differ from the rest.

Select a baseline
Baseline

Length chunking

Fixed-size token windows.

M3DocDep (ours)

Tree-guided structure-aware chunking

Global dependency tree → section subtree chunks with section_path and page metadata.

Chunk texts are exactly as reported in the supplementary's tab:chunking_examples; truncations marked with …

Results

Evaluated on four corpora — DUDE, MP-DocVQA, CUAD, MOAMOB — covering financial reports, contracts, scanned forms, and complex layouts.

Document Hierarchical Parsing (DHP) — F1 / STEDS (%)

HRDS HRDH DocHieNet
Qwen2.5-VL (LVLM only) 28.4 / 14.5 27.6 / 20.0 18.2 / 9.6
DSPS 65.3 / 59.6 54.1 / 38.4 35.6 / 23.8
DSHP-LLM 44.9 / 29.5 61.3 / 51.3 64.3 / 53.5
M3DocDep (ours) 82.9 / 76.5 77.8 / 71.7 76.0 / 70.8

~5× STEDS over a plain LVLM. Structural constraints (header-centric priors, MST decoding, cross-page edges) matter.

Downstream QA — ANLS (%)

Dataset Ours Δ vs. MultiDocFusion
DUDE21.4+2.84
MP-DocVQA18.2+2.02
CUAD29.3+1.87
MOAMOB27.1+1.18

Averaged over LLaVA-OneVision-1.5, InternVL-3.5, Qwen2.5-VL readers.

Ablation — STEDS drop

No cross-page edges −9.26
MST → local argmax −6.70
No header-centric prior −2.55
SoftROI → uniform pooling −1.70

Avg STEDS drop across 3 datasets (HRDS · HRDH · DocHieNet).

Robust to component swaps

M3DocDep is top-1 nDCG under every individual pipeline swap: 3 DP backbones (DETR, DiT, VGT), 3 OCR engines (Tesseract, EasyOCR, TrOCR), and 4 embedding models (BGE, E5, BM25, MM-Embed). Backbone-agnostic by design.

BibTeX

@InProceedings{Shin_2026_CVPR,
    author    = {Shin, Joongmin and Park, Jeongbae and Seo, Jaehyung and Lim, Heuiseok},
    title     = {M3DocDep: Multi-modal, Multi-page, Multi-document Dependency Chunking with Large Vision-Language Models},
    booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
    month     = {June},
    year      = {2026},
    pages     = {16603-16613}
}