LLaVA-OneVision-1.5 Goes Fully Open Source: 8B Model Pre-training Completed in Just 4 Days for $16,000

LLaVA was introduced in 2023, efficiently connecting open-source vision encoders with large language models through low-cost alignment, enabling "image viewing - understanding - dialogue" multimodal capabilities to proliferate in the open ecosystem. This significantly narrowed the gap with top-tier closed-source models, marking an important milestone for open-source multimodal paradigms.

The LLaVA series has evolved systematically: LLaVA started with low-cost alignment connecting "vision encoder + large language model," LLaVA-1.5 enhanced understanding with larger, cleaner data and high-resolution input, and LLaVA-NeXT expanded to OCR/mathematical reasoning and multi-scenario tasks. Subsequently, it branched into LLaVA-NeXT-Video for temporal video processing and multi-frame inference, and LLaVA-NeXT-Interleave supporting alternating multi-image text and cross-image joint reasoning. Finally, LLaVA-OneVision converged into a unified interface covering images/documents/charts/multi-images/videos, balancing effectiveness and efficiency.

While multimodal alignment interfaces and architectures are converging, truly "reproducible" open-source paths still differ from "weight-only releases." Models like Qwen2.5-VL and InternVL3.5 have set high baselines in OCR, document understanding, mathematics, and cross-image reasoning, but complete data inventories, cleaning and mixing ratios, and alignment/sampling and training schedules are often only partially disclosed, making end-to-end reproduction difficult.

The Inspiration Laboratory team, in collaboration with LMMs-Lab, has launched LLaVA-OneVision-1.5 around three goals: "high performance - low cost - strong reproducibility." They provide a completely open concept-balanced 85M pre-training dataset (LLaVA-OV-1.5-Mid-Training-85M) and carefully curated 22M instruction dataset (LLaVA-OV-1.5-Instruct-22M), using a compact three-stage process (language-image alignment Stage-1, concept balancing and high-quality knowledge injection Stage-1.5, instruction fine-tuning Stage-2).

Combined with offline parallel data packing (up to approximately 11× padding compression) and Megatron-LM + distributed optimizer, the Stage-1.5 pre-training of an 8B-scale VL model is completed in approximately 4 days on 128 A800 GPUs, with budget controlled at $16,000.

**Key Technical Innovations:**

**Data Construction Highlights:** The 85M pre-training dataset integrates 8 heterogeneous sources including COYO-700M, Obelics, DataComp-1B, LAION-CN, ImageNet-21K, SAM-1B, MINT, and Zero250M, forming approximately 20 million Chinese and 65 million English image-text pairs. To address long-tail concept sparsity and original caption noise/missing issues, they employ a feature-driven "concept balancing" strategy using MetaCLIP encoder to map all images and 500,000-scale concept word embeddings into shared vector space.

The 22M instruction data covers eight categories: Caption, Chart & Table, Code & Math, Domain-specific, General VQA, Grounding & Counting, OCR, and Science.

**Training Strategy:** 1. **Visual Encoder Pre-training:** Uses self-developed MVT v1.5 (RICE-ViT) as the visual backbone, introducing unified Region Cluster Discrimination mechanism trained on 450 million images and 2.4 billion candidate regions.

2. **Three-stage Learning Process:** - Stage-1: Language-image alignment using LLaVA-1.5 558K dataset - Stage-1.5: High-quality knowledge mid-term pre-training on concept-balanced 85M data - Stage-2: Visual instruction alignment based on 22M instruction data

3. **Offline Parallel Data Packing:** Achieves up to approximately 11× padding compression by clustering samples by length and multi-threading short samples into sequences approaching target length.

4. **Mixed Parallel and Long Context:** Employs tensor parallel (TP) + pipeline parallel (PP) + sequence/context parallel with distributed optimizer coordination.

**Results and Impact:** Experimental results show that LLaVA-OneVision demonstrates competitive or superior performance compared to Qwen2.5-VL across multiple public multimodal benchmarks. The framework provides complete transparency with data, training and packing toolchains, configuration scripts, logs, and reproducible evaluation commands.

**Resources:** - Paper: "LLaVA-OneVision-1.5: Fully Open Framework for Democratized Multimodal Training" - Code: https://github.com/EvolvingLMMs-Lab/LLaVA-OneVision-1.5 - Technical Report: https://arxiv.org/abs/2509.23661 - Data/Models: https://huggingface.co/collections/lmms-lab/llava-onevision-15-68d385fe73b50bd22de23713 - Demo: https://huggingface.co/spaces/lmms-lab/LLaVA-OneVision-1.5

LLaVA-OneVision-1.5 demonstrates that with concept-balanced 85M pre-training data and high-quality instruction data, combined with RICE-ViT fine-grained visual foundation and compact three-stage strategy, 8B-scale models can match or partially exceed mainstream open-source and some closed-source multimodal models at lower token and computational costs. The team emphasizes this as straightforward reproduction work, providing complete access to data, toolchains, scripts, configurations, logs, and evaluation recipes with clear reproduction paths and well-defined dependencies.

Disclaimer: Investing carries risk. This is not financial advice. The above content should not be regarded as an offer, recommendation, or solicitation on acquiring or disposing of any financial products, any associated discussions, comments, or posts by author or other users should not be considered as such either. It is solely for general information purpose only, which does not consider your own investment objectives, financial situations or needs. TTM assumes no responsibility or warranty for the accuracy and completeness of the information, investors should do their own research and may seek professional advice before investing.

Tiger Brokers

LLaVA-OneVision-1.5 Goes Fully Open Source: 8B Model Pre-training Completed in Just 4 Days for $16,000

Most Discussed