Large Vision-Language Model (LVLM) High-Resolution Image Processing Research Roadmap & Literature Library
๐ก Tip: If you find this repository's structure or content difficult to understand, visit deepwiki for a comprehensive detailed explanation.
The evolution of Large Vision-Language Models (LVLM / VLM) has a core thread: how to make models see more clearly. This is essentially a journey of breaking through the input limitations of early vision encoders (typically ViT pre-trained at 224x224 or 336x336 resolution). We can divide this history into six distinct phases, especially with the native multimodal and ultra-long context explosion in 2025.
Early VLMs faced severe loss of local details (e.g., document text, small objects). Researchers began treating large images like puzzles, forcing them into fixed-size tiles.
- Representative Works: Monkey used sliding windows to slice images into 448x448 tiles; LLaVA-UHD introduced adaptive slicing to preserve original image shape.
Moving away from fixed slicing, models started dynamically allocating different numbers of tokens based on the original image size and aspect ratio. This phase saw model resolutions hitting true 4K levels.
- Representative Works: InternLM-XComposer2-4KHD pioneered dynamic layout support for 4K HD (3840x1600); InternVL 1.5 used brute-force slicing into up to 40 tiles (approaching 4K).
As slice numbers increased massively (a 4K image could generate tens of thousands of tokens), LLM inference costs exploded quadratically. Researchers had to find "lazy" optimization strategies.
- Representative Works: SliME (Beyond LLaVA-HD) used Mixture of Experts (MoE) and compression; DeepStack proposed the brilliant "layer stacking" idea, inputting visual tokens into different LLM layers in parallel without increasing sequence length.
Researchers began rethinking the "patching" approach, attempting to enable models to handle arbitrary resolutions natively from the ground up via position encoding or encoder architecture.
- Representative Works: Qwen2-VL proposed Naive Dynamic Resolution combined with M-RoPE to handle variable-length sequences; Pixtral 12B chose to train a vision encoder from scratch with RoPE-2D.
In early 2025, research entered deep waters. One focus was reducing the expensive cost of high-resolution pre-training, and the other was the "OneVision" philosophyโa single architecture handling images, single-image, multi-image, and long videos, all with dynamic high-resolution support.
- Representative Works: PS3 (CVPR 2025) reduced 4K pre-training costs by 79x; MiniCPM-o 2.6 achieved astonishing 1.8M pixel support on edge devices; Qwen2.5-VL further enhanced Naive Dynamic Resolution.
With the release of Qwen3-VL and InternVL 3.5, models completely broke the boundary between "image" and "video", supporting million-level context (1M Context) and native streaming input. High resolution is no longer a bottleneck but unified with long video understanding.
- Representative Works: Qwen3-VL (256K native context, 1M extended), InternVL 3.5 (241B, native multimodal pre-training), MiniCPM-o 4.5 (Full-duplex streaming multimodal).
| Time | Phase | Key Tech / Event | Representative Models |
|---|---|---|---|
| 2023 Late | Early Exploration | Resampling & Position-Aware Adapter | Qwen-VL, Monkey |
| 2024 H1 | Dynamic Slicing | AnyRes Grid & Adaptive Slicing become mainstream | LLaVA-NeXT, InternVL 1.5 |
| 2024 H1 | 4K Breakthrough | Pioneering 4K Resolution & Dynamic Layout | InternLM-XComposer2-4KHD |
| 2024 H2 | Efficiency Optimization | Token Compression, MoE Routing & Layer Stacking | SliME, DeepStack |
| 2024 H2 | Native Perception | 3D-RoPE / 2D-RoPE fully support variable context | Qwen2-VL, Pixtral |
| 2025 H1 | Unified Omni | M-RoPE Enhanced, Edge 1.8M Pixels, Low-cost 4K Pre-training | Qwen2.5-VL, MiniCPM-o 2.6, PS3 |
| 2025 Mid | Linear & Streaming | Linear Sequence Modeling, Zero-Padding Streaming, Foveated Vision | V-Mamba-XL, Qwen2.5-VL-Flash, FocusLLaVA |
| 2025 Late | Native Agent | Native Sequence Ext., Semantic Compression, Vision Agent Workflow | Qwen3-VL, DeepSeek-VL2, Gemini 1.5-002 |
| 2026 Early | Cognitive Reform | Visual Causal Flow, High-Res Microscopy, Multimodal Quantized MoE | DeepSeek-OCR 2, RAViT, MuViT |
graph TB
%% Style Definitions: Rounded corners, soft colors, larger font
classDef low fill:#e3f2fd,stroke:#1565c0,stroke-width:2px,rx:10,ry:10,color:#000,font-size:14px;
classDef slice fill:#fff9c4,stroke:#fbc02d,stroke-width:2px,rx:10,ry:10,color:#000,font-size:14px;
classDef dynamic fill:#e8f5e9,stroke:#2e7d32,stroke-width:2px,rx:10,ry:10,color:#000,font-size:14px;
classDef efficient fill:#f3e5f5,stroke:#7b1fa2,stroke-width:2px,rx:10,ry:10,color:#000,font-size:14px;
classDef native fill:#ffccbc,stroke:#d84315,stroke-width:2px,rx:10,ry:10,color:#000,font-size:14px;
classDef future fill:#e0f7fa,stroke:#006064,stroke-width:2px,rx:10,ry:10,color:#000,font-size:14px;
%% Link Styles
linkStyle default stroke:#666,stroke-width:2px;
NodeStart((Early<br>Fixed Res)):::low
subgraph S1 [2023 Slice & Stitch]
direction TB
Monkey:::slice
LLaVA_UHD[LLaVA-UHD]:::slice
DocOwl[mPLUG-DocOwl 1.5]:::slice
end
subgraph S2 [2024 H1 Dynamic Res]
direction TB
QwenVL[Qwen-VL]:::dynamic
LLaVANext[LLaVA-NeXT]:::dynamic
InternM[InternLM-XC2]:::dynamic
InternVL[InternVL 1.5]:::dynamic
end
subgraph S3 [2024 H2 Token Efficiency]
direction TB
SliME[SliME]:::efficient
DeepStack:::efficient
end
subgraph S4 [2024 H2 Native Perception]
direction TB
Qwen2_VL[Qwen2-VL]:::native
Pixtral[Pixtral 12B]:::native
InternVL2_5[InternVL 2.5]:::native
end
subgraph S5 [2025 Omni & Streaming]
direction TB
PS3["PS3"]:::future
MiniCPM["MiniCPM-o 2.6"]:::future
VMamba["V-Mamba-XL"]:::future
Focus["FocusLLaVA"]:::future
Qwen2_5_Flash["Qwen2.5-VL-Flash"]:::future
Gemini[Gemini 1.5 Pro-002]:::future
end
subgraph S6 [2026 Native Agent & Cognition]
direction TB
Qwen3_VL["Qwen3-VL"]:::future
DeepSeek["DeepSeek-OCR 2"]:::future
RAViT["RAViT"]:::future
end
%% Core Evolution Path
NodeStart --> Monkey & LLaVA_UHD & QwenVL
Monkey --> InternM & InternVL
QwenVL --> LLaVANext
LLaVA_UHD --> DocOwl
%% Phase Transitions
InternM -.-> Qwen2_VL
InternVL --> InternVL2_5
QwenVL --> Qwen2_VL
SliME --> DeepStack
%% Modern Architectures
Qwen2_VL & Pixtral --> PS3
InternVL2_5 --> MiniCPM
Qwen2_VL --> VMamba --> Focus
Focus --> Qwen2_5_Flash --> Qwen3_VL
DeepStack -.-> DeepSeek
PS3 & Pixtral --> RAViT
Gemini -.-> Qwen3_VL
(Note: Data based on latest public records as of March 2026)
| Model Name | Release Date | Resolution Strategy | Max Resolution | Core Innovation |
|---|---|---|---|---|
| RAViT / MuViT | 2026.02 | Multi-Resolution | Gigapixel (Micro) | CVPR 2026 work, Adaptive Transformer for ultra-high res microscopy/panorama |
| DeepSeek-OCR 2 | 2026.01 | Visual Causal Flow | Arbitrary | Visual causal flow mechanism, breaking traditional slicing logic, enhancing reasoning coherence |
| DeepSeek-VL2 | 2025.12 | MoE + Global | 4K+ (OCR) | Mixture-of-Experts architecture optimized for OCR and high-res documents |
| Qwen3-VL | 2025.11 | Interleaved-MRoPE | 4K+ / 1M Context | Full-band M-RoPE, native 256K context supporting ultra-long video |
| TokenPacker | 2025.10 | Semantic Compression | 4K (Compressed) | Semantic clustering-based on-the-fly compression, reducing 4K image tokens by 75% |
| Gemini 1.5 Pro-002 | 2025.09 | Native Linear | 8K+ / 2M Context | Linear vision attention mechanism, natively supporting ultra-long video streams |
| Qwen2.5-VL-Flash | 2025.08 | Zero-Padding Streaming | Arbitrary | 2D-RoPE streaming encoder, zero padding for arbitrary aspect ratios |
| FocusLLaVA | 2025.06 | Dynamic Foveation | 8K (Foveated) | Dynamic foveation mechanism, high-res encoding only for high-density areas |
| Scale-Any | 2025.05 | Inference Adaptation | 1344px (Zero-shot) | Training-free inference-time position interpolation for low-res models |
| Fluid-Token | 2025.04 | Entropy Sampling | Dynamic | Entropy-guided sampling, dynamically allocating tokens based on information density |
| V-Mamba-XL | 2025.03 | SSM (Mamba) | 4K (Linear) | Selective State Space Model replacing Attention for linear complexity 4K inference |
| Qwen2.5-VL | 2025.02 | Naive Dynamic+ | Arbitrary | Enhanced dynamic resolution, better alignment with human preference |
| MiniCPM-o 2.6 | 2025.01 | Tile + Efficient | 1.8M Pixels | High-efficiency on edge, unified architecture for single/multi-image/video |
| PS3 | 2025.01 | Patch Selection | 4K (Pre-train) | Local contrastive learning, reducing 4K pre-training costs by 79x |
| InternVL 2.5 | 2024.12 | Dynamic + MPO | 4K+ | MPO preference optimization, enhancing dynamic resolution robustness |
| Pixtral 12B | 2024.10 | RoPE-2D | Arbitrary (Native) | Native Vision Encoder trained from scratch supporting arbitrary aspect ratios |
| Qwen2-VL | 2024.09 | Naive Dynamic | Arbitrary (Native) | M-RoPE rotary position encoding, treating images as variable-length token streams |
| DeepStack | 2024.06 | Layer Stacking | 4K+ | Stacking visual tokens into different layers, not occupying sequence length |
| SliME | 2024.06 | MoE + Global | Arbitrary | Local/Global Token MoE routing for cost efficiency |
| InternLM-XC2-4KHD | 2024.04 | Dynamic | 4K (3840ร1600) | Pioneering 4K dynamic layout support |
| InternVL 1.5 | 2024.04 | Dynamic Tile | 4K (40 tiles) | Strong vision backbone (InternViT-6B), brute-force slicing |
| LLaVA-UHD | 2024.03 | Adaptive Slice | Arbitrary Ratio | Adaptive slicing + compression layer to avoid shape distortion |
| Monkey | 2023.11 | Sliding Window | 1344ร896 | Multi-way LoRA processing for different slice positions |
- Date: 2026.02 (arXiv / CVPR 2026)
- Innovation: Proposed a resolution-adaptive Transformer that dynamically adjusts computation based on input image complexity non-intrusively, without complex preprocessing slicing.
- Link: Paper
- Date: 2026.02 (CVPR 2026)
- Innovation: Ultra-high resolution processing solution for gigapixel microscopy images, demonstrating scalability of native Transformer architectures at extreme resolutions.
- Link: Paper
- Date: 2026.01
- Innovation: Introduced "Visual Causal Flow" mechanism. Instead of simple static slicing, it mimics the dynamic causal process of human reading and scanning, solving logical coherence issues in ultra-high-resolution documents.
- Link: Code | HuggingFace
- Date: 2025.12
- Innovation: Adopted Mixture-of-Experts (MoE) architecture specifically optimized for vision-language tasks, especially for high-density document processing efficiency.
- Link: Paper | Code
- Date: 2025.11
- Innovation: Towards Native Multimodal Agent. Qwen3-VL supports 4K+ and 1M context, optimized for GUI operations and complex visual tasks, using Interleaved-MRoPE for full-band position awareness.
- Link: Paper | Code
- Date: 2025.10 (ICCV 2025)
- Innovation: Addressing excessive tokens from high-res images, proposed an On-the-fly Compression algorithm based on semantic clustering. Retains tokens only in texture-rich areas, compressing effective tokens of a 4K image to 1/4.
- Link: Paper
- Date: 2025.09
- Innovation: Introduced Linear Vision Attention mechanism specifically optimized for visual modality, completely solving KV Cache memory explosion, natively supporting ultra-long video streams (10M+ context).
- Link: Blog
- Date: 2025.08
- Innovation: Realized true "Zero-Padding" for arbitrary aspect ratio images. Uses 2D-RoPE streaming encoder, allowing images to be input at original resolution and ratio, eliminating semantic loss at slice edges.
- Link: Blog
- Date: 2025.06 (CVPR 2025)
- Innovation: Proposed Dynamic Foveation mechanism, performing high-resolution encoding only on high-information-density regions, downsampling the background, significantly improving inference speed.
- Link: Paper
- Date: 2025.05 (arXiv)
- Innovation: Training-free plugin module that adjusts position encoding interpolation during inference, enabling low-resolution models to "understand" high-resolution inputs.
- Link: Paper
- Date: 2025.04 (ICLR 2025 Oral)
- Innovation: Introduced entropy-guided sampler, dynamically allocating tokens based on image region information densityโmore for complex areas, fewer for simple backgrounds.
- Link: OpenReview
- Date: 2025.03 (CVPR 2025)
- Innovation: Replaced ViT's Attention with Selective State Space Model (SSM), achieving linear complexity processing for 4K resolution images.
- Link: Paper
- Date: 2025.02
- Innovation: This version further optimized Naive Dynamic Resolution, making understanding of different aspect ratios more aligned with human intuition, significantly improving OCR capabilities.
- Link: Paper | Code
- Date: 2025.01 (CVPR 2025)
- Innovation: Proposed Top-down Patch Selection, selecting only key regions for contrastive learning, reducing computation by 79x.
- Link: Paper
- Date: 2025.01
- Innovation: Strongest on edge. Supports real-time streaming multimodal interaction, maintaining 1.8 Million Pixels high-res processing capability while significantly reducing edge inference latency.
- Link: Code
- Date: 2024.12
- Innovation: Introduced MPO (Mixed Preference Optimization), further enhancing robustness of dynamic resolution.
- Link: Code
- Date: 2024.10
- Innovation: Completely abandoned CLIP, training from scratch a Vision Encoder supporting arbitrary aspect ratios, using RoPE-2D instead of absolute position encoding.
- Link: Code
- Date: 2024.09
- Innovation: Game changer. Proposed Naive Dynamic Resolution โ treating images as a variable-length stream of tokens.
- Link: Code
- Innovation: MoE routing & Layer Stacking optimization.
- Innovation: Pioneered 4K dynamic layout support.
- Innovation: Popularized the AnyRes slicing paradigm.
- Innovation: Pioneering works in High-Resolution VLMs.
This project aims to maintain the most cutting-edge and comprehensive roadmap of High-Resolution VLM technology. Due to the rapid development of the field (especially between 2025-2026), omissions are inevitable.
We welcome community contributions:
- Submit Issues: Report missing papers, model updates, or errata.
- Submit PRs: Add new
Papersentries or optimize comparison tables. - Discussions: Share your views on the future of "Gigapixel Vision" or "Native Multimodal" in Issues.
๐ก Tip: When submitting new papers, please try to follow the existing format:
Title+Date+Core Innovation+Link.
Content in this repository is licensed under the MIT License. Please cite the source if used.