ERNIE-4.5-VL-28B-A3B-Thinking: A Breakthrough in Multimodal AI

Model Highlights

Built upon the powerful ERNIE-4.5-VL-28B-A3B architecture, the newly upgraded ERNIE-4.5-VL-28B-A3B-Thinking achieves a remarkable leap forward in multimodal reasoning capabilities. 🧠✨ Through an extensive mid-training phase, the model absorbed a vast and highly diverse corpus of premium visual-language reasoning data. This massive-scale training process dramatically boosted the model’s representation power while deepening the semantic alignment between visual and language modalities—unlocking unprecedented capabilities in nuanced visual-textual reasoning. 📊

The model leverages cutting-edge multimodal reinforcement learning techniques on verifiable tasks, integrating GSPO and IcePop strategies to stabilize MoE training combined with dynamic difficulty sampling for exceptional learning efficiency. ⚡ Responding to strong community demand, we’ve significantly strengthened the model’s grounding performance with improved instruction-following capabilities, making visual grounding functions more accessible than ever. 🎯 Additionally, our innovative “Thinking with Images” feature, when paired with tools like image zooming and image search, dramatically elevates the model’s ability to process fine-grained details and handle long-tail visual knowledge. 🔍🖼️

Together, these enhancements form a critical foundation for developing sophisticated multimodal agents, empowering developers and researchers to create next-generation AI applications that push the boundaries of what’s possible in visual-language understanding. 🤖🌟

benchmark

Key Capabilities

As a lightweight model that activates only 3B parameters ⚡, ERNIE-4.5-VL-28B-A3B-Thinking closely matches the performance of the industry’s top flagship models across various benchmarks. 🚀

Visual Reasoning 🧠👁️: Bolstered by large-scale reinforcement learning, the model demonstrates exceptional multi-step reasoning, chart analysis, and causal reasoning capabilities in complex visual tasks! 📊✨
STEM Reasoning 🔬📐: Leveraging its powerful visual abilities, the model achieves a leap in performance on STEM tasks like solving problems from photos, easily handling even complex questions! 🎯💡
Visual Grounding 📍🎨: Features more precise grounding and flexible instruction execution, easily triggering grounding functions in complex industrial scenarios for a significant efficiency boost! ⚙️💪
Thinking with Images 🤔🔍: The model thinks like a human, capable of freely zooming in and out of images to grasp every detail and uncover all information. 🖼️✨
Tool Utilization 🛠️⚡: Empowered by robust tool-calling capabilities, the model can instantly use functions like image search to easily identify long-tail knowledge and achieve comprehensive information retrieval! 🔎📚
Video Understanding 🎬🎥: The model possesses outstanding temporal awareness and event localization abilities, accurately identifying content changes across different time segments in a video, making video analysis smarter and more efficient! ⏱️🌟

Showcase

Visual Reasoning

Case: Analyzing a Peak-Time Chart to Identify Optimal Visiting Hours

In this scenario, the model receives an image showing a “Peak Time Reminder” chart that visualizes customer traffic intensity across different time slots during the week.

The user asks the model to determine the optimal visiting periods between November 8 and 12, 2025, avoiding high-traffic hours and business peak days.

ERNIE-4.5-VL-28B-A3B-Thinking first determines the weekday corresponding to each date in the given range, then interprets the chart’s structure, identifies the low-density intervals (12:00–14:00), cross-references them with the weekday and business schedule, and outputs a clear, structured recommendation for the best visiting times.

STEM Reasoning

Case: Solving a Bridge Circuit to Compute Equivalent Resistance

In this example, the model is presented with a non-trivial bridge circuit and asked to calculate the equivalent resistance between nodes A and B.

This type of problem cannot be solved by direct series–parallel reduction and requires a full multi-step analysis using Ohm’s Law and Kirchhoff’s Current Law (KCL).

ERNIE-4.5-VL-28B-A3B-Thinking interprets the circuit diagram, identifies all node relationships, formulates current equations, and symbolically solves for the voltage and current ratios.
The model derives the correct analytical result, R = 7/5 Ω (1.4 Ω), while presenting a logically consistent reasoning chain.

Visual Grounding

Case: Detecting People Wearing Suits and Outputting Structured Coordinates

In this case, the model is given a image containing multiple human figures and an instruction: “Identify all people wearing suits and output their bounding box coordinates in JSON format.”

ERNIE-4.5-VL-28B-A3B-Thinking correctly follows the instruction, detecting every relevant individual and returning a complete list of bounding boxes with precise numerical coordinates.

The output reflects both its visual grounding capability — linking language prompts with visual regions — and its instruction-following consistency in structured output generation.

Figure: Visualization of the model’s grounding output — bounding boxes correspond to the JSON coordinates generated for “people wearing suits.”

Thinking with Images

Case: Identifying Text on a Blue Sign through Image Zooming

In this example, the model is asked: “What’s the text of the sign with a blue background on the wall next to the sidewalk?”

ERNIE-4.5-VL-28B-A3B-Thinking analyzes the image, locates the region of interest, and autonomously calls the image zoom-in tool to examine the sign’s details more clearly.

After zooming in, the model accurately identifies the white text on the blue sign as “HOTEL BUZA.”

This case demonstrates the model’s Think with Images capability, which enables detailed visual reasoning by dynamically focusing on fine-grained areas.

Tool Utilization

Case: Identifying a Plush Toy through External Image Search

In this example, the model is shown an image of a round yellow cartoon chicken and asked: “What is this?”

Recognizing that internal knowledge may not be sufficient, ERNIE-4.5-VL-28B-A3B-Thinking autonomously decides to call an image search tool to retrieve visually similar images and related product information from the web.

It gathers multiple candidate results, compares visual attributes and contextual cues, and determines that the object is “Dundun,” a plush toy character associated with the MINISO brand.

This case illustrates the model’s tool utilization capability — performing multi-step reasoning, invoking external tools when necessary, and integrating retrieved evidence into a coherent final answer.

Video Understanding

Case: Extracting Subtitles and Locating Specific Scenes within a Video

In this case, the model is presented with a video and performs two related video understanding tasks.

First, it extracts all on-screen subtitles together with their timestamps, generating a structured output that maps each sentence to its moment of appearance.

Second, when asked “Which parts of the video were filmed on a bridge?”, the model analyzes visual cues such as structures, lighting, and perspective, identifying the relevant time intervals at approximately 17s, 37s, and 47s.

This example illustrates ERNIE-4.5-VL-28B-A3B-Thinking’s integrated ability in video text extraction, temporal reasoning, and spatiotemporal scene understanding, enabling accurate and interpretable analysis of dynamic visual content.

Quickstart

Using transformers Library

Here is an example of how to use the transformers library for inference:

import torch
from transformers import AutoProcessor, AutoTokenizer, AutoModelForCausalLM

model_path = 'baidu/ERNIE-4.5-VL-28B-A3B-Thinking'
model = AutoModelForCausalLM.from_pretrained(
    model_path,
    device_map="auto",
    dtype=torch.bfloat16,
    trust_remote_code=True
)

processor = AutoProcessor.from_pretrained(model_path, trust_remote_code=True)
model.add_image_preprocess(processor)

messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "text",
                "text": "What color clothes is the girl in the picture wearing?"
            },
            {
                "type": "image_url",
                "image_url": {
                    "url": "https://paddlenlp.bj.bcebos.com/datasets/paddlemix/demo_images/example1.jpg"
                }
            },
        ]
    },
]

text = processor.tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
)
image_inputs, video_inputs = processor.process_vision_info(messages)
inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
)

device = next(model.parameters()).device
inputs = inputs.to(device)

generated_ids = model.generate(
    inputs=inputs['input_ids'].to(device),
    **inputs,
    max_new_tokens=1024,
    use_cache=False
    )
output_text = processor.decode(generated_ids[0][len(inputs['input_ids'][0]):])
print(output_text)

vLLM Inference

Install the vLLM main branch

pip install uv
uv pip install -U vllm --pre \
  --extra-index-url https://wheels.vllm.ai/nightly \
  --extra-index-url https://download.pytorch.org/whl/cu129 \
  --index-strategy unsafe-best-match

Run vLLM

# 80G*1 GPU，If an error occurs, add the --gpu-memory-utilization 0.95 and try again
vllm serve baidu/ERNIE-4.5-VL-28B-A3B-Thinking --trust-remote-code

Run vLLM using reasoning-parser and tool-call-parser

# 80G*1 GPU，If an error occurs, add the --gpu-memory-utilization 0.95 and try again
vllm serve baidu/ERNIE-4.5-VL-28B-A3B-Thinking --trust-remote-code \
 --reasoning-parser ernie45  \
 --tool-call-parser ernie45  \
 --enable-auto-tool-choice

FastDeploy Inference

Quickly deploy services using FastDeploy as shown below. For more detailed usage, refer to the FastDeploy GitHub Repository.

Note: For single-card deployment, at least 80GB of GPU memory is required.

fastdeploy serve --model baidu/ERNIE-4.5-VL-28B-A3B-Thinking \
  --max-model-len 131072 \
  --max-num-seqs 32 \
  --port 8180 \
  --quantization wint8 \
  --reasoning-parser ernie-45-vl-thinking \
  --tool-call-parser ernie-45-vl-thinking \
  --mm-processor-kwargs '{"image_max_pixels": 12845056 }'

Finetuning with ERNIEKit

ERNIEKit is a training toolkit based on PaddlePaddle, specifically designed for the ERNIE series of open-source large models. It provides comprehensive support for scenarios such as instruction fine-tuning (SFT, LoRA) and alignment training (DPO), ensuring optimal performance.

Usage Examples:

# Download model
huggingface-cli download baidu/ERNIE-4.5-VL-28B-A3B-Thinking --local-dir baidu/ERNIE-4.5-VL-28B-A3B-Thinking
# SFT
erniekit train examples/configs/ERNIE-4.5-VL-28B-A3B-Thinking/sft/run_sft_lora_8k.yaml
# SFT (Function Call)
erniekit train examples/configs/ERNIE-4.5-VL-28B-A3B-Thinking/sft_function_call/run_sft_8k.yaml

For more detailed examples, including SFT with LoRA, multi-GPU configurations, and advanced scripts, please refer to the examples folder within the ERNIEKit repository.

License

Citation

If you find ERNIE 4.5 useful or wish to use it in your projects, please kindly cite our technical report:

@misc{ernie2025technicalreport,
      title={ERNIE 4.5 Technical Report},
      author={Baidu-ERNIE-Team},
      year={2025},
      primaryClass={cs.CL},
      howpublished={\url{https://ernie.baidu.com/blog/publication/ERNIE_Technical_Report.pdf}}
}

Model Highlights#

Key Capabilities#

Showcase#

Visual Reasoning#

Case: Analyzing a Peak-Time Chart to Identify Optimal Visiting Hours#

STEM Reasoning#

Case: Solving a Bridge Circuit to Compute Equivalent Resistance#

Visual Grounding#

Case: Detecting People Wearing Suits and Outputting Structured Coordinates#

Thinking with Images#

Case: Identifying Text on a Blue Sign through Image Zooming#

Tool Utilization#

Case: Identifying a Plush Toy through External Image Search#

Video Understanding#

Case: Extracting Subtitles and Locating Specific Scenes within a Video#

Quickstart#

Using transformers Library#

vLLM Inference#

FastDeploy Inference#

Finetuning with ERNIEKit#

License#

Citation#