M3DocDep

Multi-modal, Multi-page, Multi-document Dependency Chunking
with Large Vision-Language Models

Joongmin Shin¹, Jeongbae Park¹, Jaehyung Seo^2†, Heuiseok Lim^1,3†

¹Human-inspired AI Research, Korea University ²CSE, Konkuk University ³CSE, Korea University

^†Corresponding authors

CVPR 2026

Figure 1. The M3DocDep pipeline. (a) SharedDet produces global document blocks via document parsing + OCR in a unified coordinate frame. (b) A frozen LVLM yields page multi-modal tokens; the SoftROI embedder pools tokens inside each block with boundary-aware weighting. (c) A biaffine head scores parent candidates; an MST decoder returns a globally valid dependency tree. (d) Tree-guided chunking traverses section subtrees and binds figures/tables to their captions.

Abstract

In large-scale industrial documents with scanned images, complex layouts, and multiple pages, the effectiveness of retrieval-augmented generation (RAG) is highly dependent on chunking quality. Existing text-centric chunkers overlook the visual and structural cues present in real-world documents, leading to redundant or ambiguous chunks that impair retrieval and answer accuracy.

We propose M3DocDep, which integrates (i) SharedDet for normalizing document parsing and OCR outputs into a document-level frame, (ii) multi-modal block embeddings with boundary-aware SoftROI, (iii) global document-tree reconstruction via biaffine scoring, and (iv) structure-aware dependency chunking that preserves boundaries and reduces redundancy.

M3DocDep achieves consistent gains across both Document Hierarchical Parsing (DHP) and corpus-level RAG evaluations, improving STEDS by +28.5–39.6%, retrieval nDCG by +1.1–15.3%, and QA ANLS by +4.5–15.3%. These results demonstrate that modeling document-level dependencies with multi-modal, structure-aware chunking improves RAG performance on long, multi-page industrial documents.

Contributions

LVLM-based dependency scoring framework. Multi-modal block embeddings from a frozen LVLM, combined with parent–child edge scoring, recover a global document tree — including long-range, cross-page parent–child relations.

Tree-guided structure-aware chunking. Chunks are assembled along recovered section subtrees, preserving figure/table–caption bindings and annotating each chunk with section-path metadata — yielding retrieval units compatible with corpus-level RAG.

Validation under a shared evaluation protocol. With the same SharedDet blocks, the same chunk budget, and the same retrievers and readers — and robustness checks under DP, OCR, and embedding swaps — M3DocDep consistently improves hierarchy recovery, retrieval quality, and QA accuracy.

Headline results

Relative improvement over the strongest chunking baseline (MultiDocFusion).

+10.6%

avg retrieval nDCG gain

range +1.1 to +15.3% (4 corpora)

+9.8%

avg QA ANLS gain

range +4.5 to +15.3% (4 corpora)

~5×

STEDS over plain LVLM

on document hierarchical parsing

Method

M3DocDep treats chunking as a dependency-recovery problem. Before chunking, we reconstruct a global document dependency tree using a vision-language model, then assemble chunks from coherent section subtrees of that tree. The pipeline has four stages:

(a) SharedDet (DP + OCR)

Run document parsing and OCR once per document to obtain a stable set of layout blocks — titles, headers, text, tables, figures, captions — in a unified coordinate frame. These Global Document Blocks are shared across all downstream models, ensuring fair component-wise comparison.

(b) LVLM-based multi-modal block embedding

Each page passes through a frozen vision-language model (e.g., Qwen2.5-VL, LLaVA-OneVision-1.5, InternVL-3.5) to produce Page Multi-modal Tokens. A SoftROI Embedder then pools tokens inside each block with a boundary-aware weighting that is robust to box jitter and OCR noise, yielding a multi-modal embedding per block.

(c) Global Document Dependency Parsing

For each block, a biaffine scoring head scores parent candidates over the block embeddings, with a header-centric prior that restricts candidates to plausible attachments. An MST decoder then turns the edge scores into a globally valid dependency tree — single root, single parent, acyclic, including cross-page links. This replaces the long-form sequence generation used by SFT-based LVLMs.

(d) Structure-aware dependency chunking

We traverse the recovered tree by section subtree (DFS), force figures and tables to stay with their captions via tree-link binding, and annotate every chunk with its section path, page range, and constituent block IDs. The resulting chunks align retrieval units with true semantic and hierarchical structure.

Qualitative example

A real case from the paper: a five-page physics paper on photon trajectories in the equatorial plane of a Kerr black hole. Fig. 1 sits on page 2; its caption continues across a column boundary. We trace M3DocDep end-to-end — from raw multi-page input, to the recovered dependency tree, to the structure-aware chunk it emits.

Input. A 5-page physics paper. Fig. 1 appears on page 2; the surrounding section content (title, intro, body paragraphs, caption) is scattered across pages 1–3. Naive chunkers slice this stack by tokens or sentence boundaries — losing the figure–caption binding the first time a page break gets in the way.

The 5-page input document with page 2 highlighted

Dependency tree. M3DocDep treats every layout block (title, section-title, text, figure, figure-caption) as a node and learns parent–child edges across the whole document — including cross-page links. The orange Key Path highlights the dependency chain this chunk follows: ROOT → 1:title → 17:section-title → 19:figure → 20:figure-caption.

Recovered dependency tree with Key Path highlighted

Output. The structure-aware chunk M3DocDep emits. It records the section path (# Title › ## 1. Intro), the page range spanned (2), the figure block with its bound caption, and the explicit dependency edge (19:figure → 20:caption). This is the retrieval unit used at RAG time — the reader sees the figure with its caption and its enclosing section all in one shot.

The final structure-aware chunk card emitted by M3DocDep

Figures reproduced from the paper's supplementary qualitative example.

Try it: Compare actual chunks per baseline

Reproduced from the supplementary's Table on chunking comparisons: for the same Kerr black hole input, each method produces its own 3 chunks. Switch baselines to see how M3DocDep's structure-aware chunks differ from the rest.

Select a baseline

Baseline

Length chunking

Fixed-size token windows.

M3DocDep (ours)

Tree-guided structure-aware chunking

Global dependency tree → section subtree chunks with section_path and page metadata.

Chunk texts are exactly as reported in the supplementary's tab:chunking_examples; truncations marked with …

Results

Evaluated on four corpora — DUDE, MP-DocVQA, CUAD, MOAMOB — covering financial reports, contracts, scanned forms, and complex layouts.

Document Hierarchical Parsing (DHP) — F1 / STEDS (%)

	HRDS	HRDH	DocHieNet
Qwen2.5-VL (LVLM only)	28.4 / 14.5	27.6 / 20.0	18.2 / 9.6
DSPS	65.3 / 59.6	54.1 / 38.4	35.6 / 23.8
DSHP-LLM	44.9 / 29.5	61.3 / 51.3	64.3 / 53.5
M3DocDep (ours)	82.9 / 76.5	77.8 / 71.7	76.0 / 70.8

~5× STEDS over a plain LVLM. Structural constraints (header-centric priors, MST decoding, cross-page edges) matter.

Downstream QA — ANLS (%)

Dataset	Ours	Δ vs. MultiDocFusion
DUDE	21.4	+2.84
MP-DocVQA	18.2	+2.02
CUAD	29.3	+1.87
MOAMOB	27.1	+1.18

Averaged over LLaVA-OneVision-1.5, InternVL-3.5, Qwen2.5-VL readers.

Ablation — STEDS drop

No cross-page edges −9.26

MST → local argmax −6.70

No header-centric prior −2.55

SoftROI → uniform pooling −1.70

Avg STEDS drop across 3 datasets (HRDS · HRDH · DocHieNet).

Robust to component swaps

M3DocDep is top-1 nDCG under every individual pipeline swap: 3 DP backbones (DETR, DiT, VGT), 3 OCR engines (Tesseract, EasyOCR, TrOCR), and 4 embedding models (BGE, E5, BM25, MM-Embed). Backbone-agnostic by design.

BibTeX

@InProceedings{Shin_2026_CVPR,
    author    = {Shin, Joongmin and Park, Jeongbae and Seo, Jaehyung and Lim, Heuiseok},
    title     = {M3DocDep: Multi-modal, Multi-page, Multi-document Dependency Chunking with Large Vision-Language Models},
    booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
    month     = {June},
    year      = {2026},
    pages     = {16603-16613}
}