[{"content":"Today, we officially release ERNIE 5.1. While inheriting the pre-training foundation of ERNIE 5.0, it compresses total parameters to approximately one-third and active parameters to approximately one-half, achieving leading foundational performance at its model scale using only about 6% of the pre-training cost of comparable models.\nTo advance the evolution of large models toward autonomous decision-making agents, we built an entirely new disaggregated fully-asynchronous reinforcement learning infrastructure, specifically addressing the global optimization challenges posed by training-inference divergence, low resource utilization, and long-tail effects.\nOn this foundation, through scaled agentic post-training combined with an end-to-end synergy strategy across environment, expert, and integration stages, we achieved a dual leap in both training efficiency and model capability, ensuring that the model maintains exceptional stability and outstanding performance even when handling complex long-tail tasks.\nAs one of the current cost-performance benchmarks among Chinese-developed large models, ERNIE 5.1 achieves a leap forward in parameter efficiency and training cost optimization while maintaining flagship-level intelligence. Its performance has been validated on internationally authoritative leaderboards: on May 9, ERNIE 5.1 scored 1,223 to claim 4th place globally and 1st among Chinese models on the Arena Search leaderboard.\nVisit the official website at https://ernie.baidu.com to chat with the latest ERNIE 5.1 model and explore a new era of intelligence. Baidu AI Studio has also launched an ERNIE 5.1 Playground for hands-on experience: 飞桨AI Studio星河社区-人工智能学习与实训社区.\nERNIE 5.1: Outstanding Agent and Reasoning Capabilities, with World Knowledge Ranking Among Top-Tier Models ERNIE 5.1 delivers strong results across multiple authoritative industry benchmarks, particularly in agentic capabilities, knowledge, reasoning, and deep search:\nOutstanding agentic capabilities on par with the world\u0026rsquo;s top models: On the τ³-bench and SpreadsheetBench-Verified agent evaluation tasks, ERNIE 5.1 surpasses DeepSeek-V4-Pro, with agentic capabilities approaching those of leading closed-source models. It also performs exceptionally well on the Search Arena leaderboard. Leading world knowledge and creative writing capabilities: On GPQA and MMLU-Pro evaluations, ERNIE 5.1 approaches the performance of leading closed-source models. In internal evaluations, ERNIE 5.1\u0026rsquo;s creative writing capabilities approach those of Gemini 3.1 Pro. Reasoning capabilities approaching leading closed-source models: On AIME26 (with tool use), a challenging mathematical competition benchmark, ERNIE 5.1 scores 99.6 — second only to Gemini 3.1 Pro. Technical Features Multi-Dimensional Elastic Pre-Training: Pre-training Compute Cost at Only 6% of Comparable Models ERNIE 5.1 is derived from ERNIE 5.0, extracting the optimal sub-network architecture from ERNIE 5.0\u0026rsquo;s multi-dimensional elastic sub-model matrix to effectively inherit the knowledge and capabilities encoded in ERNIE 5.0 while significantly reducing pre-training cost. The R\u0026amp;D team proposed an innovative Once-For-All elastic training framework. While traditional approaches require separate pre-training runs for models at different scales, ERNIE 5.0 jointly optimizes a large number of sub-models with varying depths, expert capacities, and routing sparsity levels through a dynamic sampling mechanism within a single pre-training run, constructing a sub-model matrix that spans diverse parameter scales and computational budgets. Throughout this process, the model achieves elastic compression and expansion along three dimensions:\nElastic depth: During training, the number of active Transformer layers is randomly varied, enabling sub-models at different depths to share weights and adaptively learn a balance between deep and shallow representations. Elastic width / expert capacity: The effective expert capacity in MoE layers is elastically controlled by varying the number of experts participating in routing. By dynamically sampling subsets of experts, the model learns to operate under both full and reduced expert-pool configurations, thereby improving expert utilization efficiency. Elastic sparsity: Through a variable Top-k routing mechanism, the number of activated experts is flexibly adjusted. Activating fewer experts reduces inference cost and improves decoding efficiency, while activating more enhances model capability, achieving a dynamic trade-off between inference overhead and performance. Building on this breakthrough, ERNIE 5.1 compresses total parameters to approximately one-third and activated parameters to approximately half those of ERNIE 5.0, with pre-training compute cost at only 6% of comparable models at the same scale. Compared to ERNIE 5.0, inference cost is significantly reduced while still achieving leading performance among models of comparable scale.\nDecoupled Fully-Asynchronous Reinforcement Learning Training: Greater Efficiency, Stability, and Cost Reduction We built a disaggregated reinforcement learning infrastructure on PaddlePaddle to support the multi-stage reinforcement learning training of ERNIE 5.1. To achieve more efficient, stable, and cost-effective training for long-horizon reinforcement learning tasks, we focused our optimizations in three key areas:\nDisaggregated fully-asynchronous architecture: We designed and developed a disaggregated architecture centered on an RL Controller, fully decoupling the control plane across four major subsystems — training, inference, reward, and agent loop. The subsystems are bridged and interact through high-performance network-based data components, achieving separation of the control plane from the data plane. Under this architecture, each subsystem can be independently deployed and independently scaled, matched to its optimal compute configuration. Meanwhile, inference, training, and reward naturally form a pipeline that can be fully overlapped, establishing a highly scalable foundation for long-horizon asynchronous agentic RL training. FP8 training-inference consistency optimization: Based on PaddlePaddle\u0026rsquo;s unified training-inference framework, we implemented a unified FP8 low-precision operator library to minimize precision divergence between training and inference in reinforcement learning. To address routing divergence between training and inference in MoE models, we performed in-depth optimization of the Rollout Router Replay (R3) technique — through two-stage computation-communication overlap, combined with dynamic bit-width communication compression and multi-level KV-Cache pooling, enabling R3 with near-zero additional training-inference latency overhead while reducing K3 KL divergence by 50%, providing a critical guarantee for stable long-horizon training of ERNIE 5.1. Heterogeneous elastic resource scheduling: Thanks to the disaggregated architecture, we can flexibly assign optimal compute configurations on demand to each training, inference, and reward subsystem, fully leveraging the cluster\u0026rsquo;s elastic compute capacity to reduce end-to-end rollout latency. To address the widespread underutilization of CPU resources in AI clusters, we implemented an elastic CPU pooling strategy. This elastic mechanism fully utilizes idle CPU compute across the cluster to support logic-intensive computations such as code sandboxes and verifiers, improving resource utilization while reducing training iteration time. A Multi-Stage Reinforcement Learning Training Pipeline Centered on OPD, Ensuring Comprehensive Capability Integration The post-training of conventional large language models (LLMs) typically follows a sequential pipeline, progressing from supervised fine-tuning (SFT) to multi-stage mixed reinforcement learning (Mixed RL). However, as model capabilities continue to scale, this sequential training paradigm has increasingly become a bottleneck, severely hindering the efficiency of research, development, and iteration. Moreover, attempting to fuse all capabilities within a single training stage introduces severe multi-objective optimization conflicts, making it extremely difficult to balance performance across different domain tasks and achieve Pareto optimality — improvements in one capability often come at the cost of regressions in another (i.e., the \u0026ldquo;seesaw\u0026rdquo; effect).\nTo overcome these fundamental challenges, we propose a multi-stage reinforcement learning training pipeline centered on Multi-Teacher On-Policy Distillation (MOPD). This pipeline significantly accelerates the R\u0026amp;D cycle through parallelized expert model training while ensuring comprehensive and conflict-free capability integration. Specifically, the post-training pipeline of ERNIE 5.1 is a four-stage process that decouples expert training from unified capability fusion:\nStage 1: Unified Supervised Fine-Tuning (SFT). High-quality multi-domain instruction data is leveraged for fine-tuning, establishing the model\u0026rsquo;s foundational capabilities in instruction following and tool invocation, which serve as the initialization checkpoint for subsequent capability expansion. Stage 2: Domain Expert Model Training. Multiple domain-specific expert models (e.g., code, reasoning, agentic tasks) are trained in parallel. Each direction independently customizes its dedicated reward signals and training algorithms, fundamentally eliminating mutual interference across heterogeneous tasks. Stage 3: On-Policy Distillation (OPD). With the unified SFT model as the student and multiple domain expert models as teachers, the student samples from its own policy distribution and concurrently learns from multiple teachers\u0026rsquo; capabilities via token-level reverse KL divergence, efficiently consolidating the capabilities of diverse experts into a unified parameter space. Stage 4: General Online Reinforcement Learning (General-RL). Following the initial OPD stage, we deliberately introduce an online RL phase tailored for general-purpose conversational scenarios. Our experiments reveal that not all tasks are amenable to capability fusion via token-level KL-based OPD. Specifically, tasks characterized by high-entropy distributions — such as open-ended chat or creative writing — tend to suffer from low distillation efficiency and may cause excessive smoothing of the output probability distribution. To address this, we forgo distillation for this domain and instead apply online RL on top of the post-OPD model. This stage ensures the model\u0026rsquo;s instruction-following capability, generation diversity, and improved alignment with human preferences, substantially enhancing general-purpose competence while preserving the expert capabilities acquired in earlier stages. Outstanding Creative Capabilities Through iterative optimization of the technical architecture and targeted refinement of core technologies, ERNIE 5.1 delivers a comprehensive upgrade in foundational capabilities while also excelling in creative performance.\nWhether it is the precise alignment of \u0026ldquo;inspiration–emotion–expression\u0026rdquo; in creative writing, the coordinated control of logic–character–pacing in long-form narrative, or the dual balance of knowledge accuracy–stylistic adaptability in professional content, ERNIE 5.1 consistently penetrates beyond users\u0026rsquo; surface-level requests to capture their core intent, producing work that is warm, deep, and logical — exceeding expectations. This closed-loop capability from intent insight to content creation achieves not only precise synergy between comprehension and generation at the technical level, but has also earned widespread recognition from creative enterprises, content platforms, and professional writers — regarded as a benchmark creative model that understands users, understands content, and understands context.\nWe are grateful for the evaluation feedback from leading content interaction enterprises, platforms, and writers/creators. In addition, starting today ERNIE 5.1 will be progressively rolled out on over ten creative production agent platforms, including ISEKAI ZERO (a leading global AI roleplay interactive platform), Mulan AI (a creative agent platform), Diting Huanliu (an AI-native creative canvas), and Storymaster (an AI short drama generation platform). Creators and users are welcome to try them out.\nThe continuous iteration and advancement of the ERNIE family of models would not be possible without the strong support of our technical foundation and a shared commitment to long-term value alongside our users.\nWe appreciate every developer and partner who has tested and used the model in our community — each of your suggestions drives model optimization forward. We appreciate the enterprises that have chosen to partner with us — your real-world use cases are what allow technology to truly take root. Above all, we appreciate every user who has been patient with the model\u0026rsquo;s imperfections and continued to place their trust in us — it is your trust that gives us the courage to push beyond boundaries.\nThe evolution of AI has no finish line, and every advance of the ERNIE family of models is driven by real-world needs. Going forward, we will continue to stay open, listen to every voice, and ensure that technology serves our users in the most grounded way possible.\n","permalink":"/blog/posts/ernie-5.1-0508-release/","summary":"ERNIE 5.1 is officially released, achieving leading performance at only 6% of the pre-training cost of comparable models. Powered by disaggregated fully-asynchronous reinforcement learning and scaled agentic post-training, ERNIE 5.1 delivers comprehensive upgrades across Agent, reasoning, and creative capabilities, ranking 1st in China on the Arena Search Arena.","title":"ERNIE 5.1 Officially Released! Topping Multiple Leaderboards — A Model That Writes Better and Understands You More"},{"content":"On April 30, LMArena released its latest rankings. ERNIE-5.1-Preview ranked No. 1 among Chinese models and No. 13 globally on the LMArena Text leaderboard, demonstrating strong general text capabilities.\nERNIE-5.1-Preview also delivered outstanding performances across multiple category leaderboards:\nMath: #9 globally Legal \u0026amp; Government: #1 globally Business, Management \u0026amp; Financial Ops: #4 globally Software \u0026amp; IT Services: #7 globally ERNIE-5.1-Preview builds on the pre-training foundation of ERNIE-5.0 while compressing total parameters to approximately 1/3 and active parameters to approximately 1/2, achieving leading performance at its model scale using only about 6% of the pre-training cost of comparable models. Powered by decoupled fully-asynchronous reinforcement learning and scaled agentic post-training, ERNIE-5.1-Preview delivers comprehensive improvements in foundational capabilities, cost-effectiveness, and creative performance.\nWe welcome you to directly experience the ERNIE series models through https://ernie.baidu.com/.\nMoving forward, we will continue to deepen our technical expertise and promote open collaboration, partnering with developers worldwide to drive innovation in the intelligent era.\n","permalink":"/blog/posts/ernie-5.1-preview-0430-release-on-lmarena/","summary":"On April 30, LMArena released its latest rankings. ERNIE-5.1-Preview ranked No. 1 among Chinese models and No. 13 globally on the LMArena Text Arena, placing in the global top 10 across multiple category leaderboards.","title":"ERNIE-5.1-Preview Tops LMArena Text Leaderboard as No.1 Chinese Model!"},{"content":" ERNIE-Image is a state-of-the-art text-to-image generation model developed by Baidu, built on a single-stream Diffusion Transformer (DiT) architecture with 8 billion DiT parameters. It achieves leading performance among open-weight models across multiple benchmarks, demonstrating exceptional capabilities in generating high-quality, diverse, and semantically accurate images from text descriptions.\nThe model leverages a unified single-stream transformer design that processes both text and image tokens within a shared attention mechanism. This architectural choice enables more effective cross-modal interaction compared to traditional dual-stream approaches, where text and image features are processed separately before being combined. By treating all tokens equally in the attention computation, ERNIE-Image achieves better alignment between textual descriptions and visual outputs.\nERNIE-Image employs a flow matching training objective, which provides more stable training dynamics and improved sample quality compared to conventional diffusion model training approaches. The flow matching framework defines a continuous-time generative process that smoothly transforms noise into structured image data, guided by the text conditioning signal. This training methodology, combined with the large-scale transformer backbone, enables the model to capture complex visual concepts and relationships described in natural language prompts.\nThe model demonstrates strong performance across a wide range of image generation tasks, including photorealistic scene generation, artistic style rendering, text rendering within images, character and portrait generation, landscape and nature scenes, abstract and concept art, fantasy and imaginative compositions, and fine-grained object rendering. Each of these capabilities has been extensively evaluated against competing open-weight models, with ERNIE-Image consistently achieving top-tier results.\nOne of the key innovations in ERNIE-Image is its approach to positional encoding for high-resolution image generation. The model utilizes an advanced positional encoding scheme that enables flexible resolution and aspect ratio support, allowing users to generate images at various dimensions without quality degradation. This is particularly important for practical applications where different image formats and sizes are required.\nThe training pipeline for ERNIE-Image involves multiple stages, beginning with large-scale pretraining on diverse image-text pairs, followed by supervised fine-tuning on high-quality curated datasets, and finally alignment optimization to improve aesthetic quality and prompt adherence. This multi-stage approach ensures that the model develops both broad visual knowledge and refined generation capabilities.\nERNIE-Image also incorporates advanced text encoding capabilities, utilizing a powerful language model to extract rich semantic representations from input prompts. This enables the model to understand complex compositional descriptions, spatial relationships, attribute bindings, and abstract concepts. The text encoder is designed to capture both local details and global semantic structure, providing comprehensive conditioning information for the image generation process.\nIn benchmark evaluations, ERNIE-Image has been tested on standard text-to-image generation benchmarks including GenEval, DPG-Bench, and human preference studies. The model achieves competitive or superior performance compared to other leading open-weight models, particularly excelling in prompt following accuracy, visual quality, and diversity of generated outputs.\nThe model supports both English and Chinese text prompts, making it accessible to a broader user base. The bilingual capability is achieved through careful multilingual training and a shared semantic space that maps both languages to a common representation before conditioning the image generation process.\nERNIE-Image represents a significant advancement in open-weight text-to-image generation, combining a powerful transformer architecture with effective training methodologies to deliver state-of-the-art image quality. The model is designed for both research and practical applications, offering a strong foundation for further development and customization in the rapidly evolving field of visual content generation.\nFor developers and researchers, ERNIE-Image provides comprehensive API access and model weights, enabling integration into various applications ranging from creative tools and design assistants to content generation pipelines and research experiments. The model\u0026rsquo;s architecture is designed for efficient inference, with optimizations that enable practical deployment at scale while maintaining high output quality across diverse prompt types and generation scenarios.\n","permalink":"/blog/posts/ernie-image/","summary":"ERNIE-Image is a text-to-image generation model built on a single-stream Diffusion Transformer (DiT) with 8B DiT parameters, achieving leading performance among open-weights models.","title":"Introducing ERNIE-Image"},{"content":"1. Introduction Most existing multimodal models excel at understanding but often remain text-centric in their outputs. Attempts to enable multimodal generation usually rely on late-fusion architectures, where specialized decoders are stitched onto a pre-trained language backbone. While functional, this patchwork approach decouples understanding from generation and limits the depth of cross-modal reasoning.\nERNIE 5.0 introduces a paradigm shift. It is a Unified Multimodal Model trained from scratch to integrate text, images, video, and audio within a single autoregressive framework.\nKey highlights:\n2.4 Trillion Parameters: A massive-scale foundational model built on a unified autoregressive backbone. Unified Objective: We map all modalities to a shared token space and optimize them end-to-end using a unified Next-Group-of-Tokens Prediction. Omni-Capability: By effectively dissolving modality barriers, the model achieves seamless multimodal understanding and generation. 2. Architecture: Genuine Unification ERNIE 5.0 adopts a fully unified approach:\nText Modeling: Utilizes standard Next-Token Prediction (NTP), accelerated by Multi-Token Prediction (MTP) for enhanced inference throughput. Vision Modeling: Adopts Next-Frame-and-Scale Prediction (NFSP). Images are treated as single-frame videos, enabling the model to learn spatial (multi-scale) and temporal (multi-frame) representations simultaneously. Audio Modeling: Implements Next-Codec Prediction (NCP) with a depth-wise autoregressive design, hierarchically modeling audio from semantic content to fine-grained acoustic details. This unified formulation allows the model to learn intrinsic semantic alignments among modalities rather than superficial translations.\n3. Scalability and Efficiency Training a 2.4T parameter multimodel presents significant computational challenges, which we address through two core technological innovations:\n3.1 Ultra-Sparse MoE We employ a Mixture-of-Experts (MoE) architecture featuring Modality-Agnostic Routing.\nShared Expert Pool: Experts are not segregated by modality (e.g., \u0026ldquo;vision experts\u0026rdquo; vs \u0026ldquo;text experts\u0026rdquo;); instead, dynamic routing is driven solely by token features. \u0026lt;3% Activation Rate: Despite the trillion-parameter scale, only ~3% of parameters are activated per token. This design yields massive capacity while keeping computational costs comparable to much smaller dense models. 3.2 Elastic Training (Once-For-All) To address diverse deployment constraints, we introduce Elastic Training, which optimizes a super-network capable of spawning multiple sub-configurations:\nElastic Depth: Stochastic layer skipping during training. Elastic Width: Dynamic restriction of the active expert pool. Elastic Sparsity: Adaptive Top-k routing for adjustable inference cost. This \u0026ldquo;Once-For-All\u0026rdquo; approach allows for the instant deployment of efficient sub-models without the need for resource-intensive retraining.\n4. Training Methodology 4.1 Data Foundation Our pre-training corpus consists of trillions of tokens, featuring UTF-16BE encoding for superior multilingual support. We utilize a mix of paired data (image-text, video-text) and interleaved sequences to enforce robust cross-modal contextual learning.\n4.2 Training \u0026amp; Infrastructure Built upon PaddlePaddle, ERNIE 5.0 adopts a customized hybrid parallel strategy to manage the ultra-sparse MoE architecture at scale. The training process is rigorously staged—extending context length from 8K to 128K—and incorporates advanced stability techniques to prevent any single modality from dominating gradient updates.\n4.3 Post-Training To align ERNIE 5.0 for complex tasks, we developed a specialized Reinforcement Learning (RL) pipeline:\nU-RB (Unbiased Replay Buffer): Addressing long-tail response inefficiency without introducing sampling bias. Stability Mechanisms (MISC \u0026amp; WPSM): Techniques to mitigate entropy collapse and focus optimization on challenging samples. AHRL (Adaptive Hint-based RL): A scaffolding method that provides fading \u0026ldquo;thinking skeletons\u0026rdquo; (hints) to facilitate learning on sparse-reward, hard-reasoning tasks. 5. Evaluation \u0026amp; Results ERNIE 5.0 establishes new state-of-the-art benchmarks across modalities:\n5.1 Language Capabilities ERNIE 5.0 shows strong performance across both pre-training and post-training evaluations, spanning knowledge, reasoning, coding, instruction following, and agentic tool-use tasks.\n(Table 1: Pre-training comparisons)\n(Table 2: Post-training comparisons)\n5.2 Multimodal Understanding Demonstrates strong multimodal understanding capabilities across diverse benchmarks.\n(Table 3: Multimodal Understanding)\n5.3 Generation Capabilities Shows superior performance in visual generation tasks for both high-fidelity images and video.\n(Table 5: Image Generation)\n(Table 6: Video Generation)\n5.4 Audio Capabilities Achieves best-in-class results in Audio Understanding (e.g., TUT2017) and competitive Text-to-Speech performance.\n(Table 7: Audio Understanding)\n(Table 8: Text-to-Speech)\n6. Conclusion ERNIE 5.0 represents a decisive step away from the fragmented \u0026ldquo;patchwork\u0026rdquo; era of AI and toward a future of truly native multimodal intelligence. By successfully unifying understanding and generation within a single, elastic, and scalable autoregressive framework, we have laid the groundwork for systems that do not just process data, but perceive and create with the fluidity of human cognition.\nLooking ahead, the innovations in Modality-Agnostic Routing and Elastic Training unlock new possibilities for deploying massive-scale intelligence across diverse environments—from cloud superclusters to edge devices—without compromising on capability. As we continue to refine this unified paradigm, ERNIE 5.0 serves as a robust foundation for the next leap in General Artificial Intelligence (AGI), where the boundaries between listening, speaking, reading, writing, and reasoning effectively dissolve.\nCitation @misc{wang2026ernie50technicalreport, title={ERNIE 5.0 Technical Report}, author={Haifeng Wang and others}, year={2026}, eprint={2602.04705}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2602.04705} } ","permalink":"/blog/posts/ernie5.0/","summary":"We introduce ERNIE 5.0: a 2.4 trillion-parameter Unified Multimodal Model trained from scratch. Integrating text, image, video, and audio into a single autoregressive framework, it overcomes the limitations of late-fusion architectures to achieve seamless cross-modal understanding and generation.","title":"ERNIE 5.0: A 2.4 Trillion-Parameter Unified Multimodal Foundation Model"},{"content":"Demo | GitHub | Hugging Face | Technical Report |🔥Official Website\nIntroduction PaddleOCR-VL-1.5 is an upgraded model achieving a new SOTA accuracy of 94.5% on OmniDocBench v1.5. To rigorously evaluate robustness against real-world physical distortions—including scanning artifacts, skew, warping, screen photography, and illumination—we propose the Real5-OmniDocBench benchmark. Experimental results demonstrate that this enhanced model attains SOTA performance on the newly curated benchmark. Furthermore, we extend the model’s capabilities by incorporating seal recognition and text spotting tasks, while remaining a 0.9B ultra-compact VLM with high efficiency.\nKey Capabilities of PaddleOCR-VL-1.5 With a parameter size of 0.9B, PaddleOCR-VL-1.5 achieves 94.5% accuracy on OmniDocBench v1.5, surpassing the previous SOTA model PaddleOCR-VL. Significant improvements are observed in table, formula, and text recognition. It introduces an innovative approach to document parsing by supporting irregular-shaped localization, enabling accurate polygonal detection under skewed and warped document conditions. Evaluations across five real-world scenarios —— scanning, skew, warping, screen-photography, and illumination—demonstrate superior performance over mainstream open-source and proprietary models. The model introduces text spotting (text-line localization and recognition), along with seal recognition, with all corresponding metrics setting new SOTA results in their respective tasks. PaddleOCR-VL-1.5 further strengthens its capability in specialized scenarios and multilingual recognition. Recognition performance is improved for rare characters, ancient texts, multilingual tables, underlines, and checkboxes, and language coverage is extended to include China\u0026rsquo;s Tibetan script and Bengali. The model supports automatic cross-page table merging and cross-page paragraph heading recognition, effectively mitigating content fragmentation issues in long-document parsing. Model Architecture Usage Install Dependencies Install PaddlePaddle and PaddleOCR:\n# The following command installs the PaddlePaddle version for CUDA 12.6. For other CUDA versions and the CPU version, please refer to https://www.paddlepaddle.org.cn/en/install/quick?docurl=/documentation/docs/en/develop/install/pip/linux-pip_en.html python -m pip install paddlepaddle-gpu==3.2.1 -i https://www.paddlepaddle.org.cn/packages/stable/cu126/ python -m pip install -U \u0026#34;paddleocr[doc-parser]\u0026#34; Please ensure that you install PaddlePaddle framework version 3.2.1 or above, along with the special version of safetensors. For macOS users, please use Docker to set up the environment.\nBasic Usage CLI usage:\npaddleocr doc_parser -i https://paddle-model-ecology.bj.bcebos.com/paddlex/imgs/demo_image/paddleocr_vl_demo.png Python API usage:\nfrom paddleocr import PaddleOCRVL pipeline = PaddleOCRVL() output = pipeline.predict(\u0026#34;https://paddle-model-ecology.bj.bcebos.com/paddlex/imgs/demo_image/paddleocr_vl_demo.png\u0026#34;) for res in output: res.print() res.save_to_json(save_path=\u0026#34;output\u0026#34;) res.save_to_markdown(save_path=\u0026#34;output\u0026#34;) Accelerate VLM Inference via Optimized Inference Servers Start the VLM inference server:\nYou can start the vLLM inference service using one of two methods:\nMethod 1: PaddleOCR method\ndocker run \\ --rm \\ --gpus all \\ --network host \\ ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddlepaddle/paddleocr-genai-vllm-server:latest-nvidia-gpu \\ paddleocr genai_server --model_name PaddleOCR-VL-1.5-0.9B --host 0.0.0.0 --port 8080 --backend vllm Method 2: vLLM method\nvLLM: PaddleOCR-VL Usage Guide\nCall the PaddleOCR CLI or Python API:\npaddleocr doc_parser \\ -i https://paddle-model-ecology.bj.bcebos.com/paddlex/imgs/demo_image/paddleocr_vl_demo.png \\ --vl_rec_backend vllm-server \\ --vl_rec_server_url http://127.0.0.1:8080/v1 from paddleocr import PaddleOCRVL pipeline = PaddleOCRVL(vl_rec_backend=\u0026#34;vllm-server\u0026#34;, vl_rec_server_url=\u0026#34;http://127.0.0.1:8080/v1\u0026#34;) output = pipeline.predict(\u0026#34;https://paddle-model-ecology.bj.bcebos.com/paddlex/imgs/demo_image/paddleocr_vl_demo.png\u0026#34;) for res in output: res.print() res.save_to_json(save_path=\u0026#34;output\u0026#34;) res.save_to_markdown(save_path=\u0026#34;output\u0026#34;) For more usage details and parameter explanations, see the documentation.\nPaddleOCR-VL-1.5-0.9B Usage with transformers Currently, the PaddleOCR-VL-1.5-0.9B model facilitates seamless inference via the transformers library, supporting comprehensive text spotting and the recognition of complex elements including formulas, tables, charts, and seals. Below is a simple script we provide to support inference using the PaddleOCR-VL-1.5-0.9B model with transformers.\nNotes: We currently recommend using the official method for inference, as it is faster and supports page-level document parsing. The example code below only supports element-level recognition and text spotting.\n# ensure the transformers v5 is installed python -m pip install \u0026#34;transformers\u0026gt;=5.0.0\u0026#34; from PIL import Image import torch from transformers import AutoProcessor, AutoModelForImageTextToText # ---- Settings ---- model_path = \u0026#34;PaddlePaddle/PaddleOCR-VL-1.5\u0026#34; image_path = \u0026#34;test.png\u0026#34; task = \u0026#34;ocr\u0026#34; # Options: \u0026#39;ocr\u0026#39; | \u0026#39;table\u0026#39; | \u0026#39;chart\u0026#39; | \u0026#39;formula\u0026#39; | \u0026#39;spotting\u0026#39; | \u0026#39;seal\u0026#39; # ------------------ # ---- Image Preprocessing For Spotting ---- image = Image.open(image_path).convert(\u0026#34;RGB\u0026#34;) orig_w, orig_h = image.size spotting_upscale_threshold = 1500 if task == \u0026#34;spotting\u0026#34; and orig_w \u0026lt; spotting_upscale_threshold and orig_h \u0026lt; spotting_upscale_threshold: process_w, process_h = orig_w * 2, orig_h * 2 try: resample_filter = Image.Resampling.LANCZOS except AttributeError: resample_filter = Image.LANCZOS image = image.resize((process_w, process_h), resample_filter) # Set max_pixels: use 1605632 for spotting, otherwise use default ~1M pixels max_pixels = 2048 * 28 * 28 if task == \u0026#34;spotting\u0026#34; else 1280 * 28 * 28 # --------------------------- # -------- Inference -------- DEVICE = \u0026#34;cuda\u0026#34; if torch.cuda.is_available() else \u0026#34;cpu\u0026#34; PROMPTS = { \u0026#34;ocr\u0026#34;: \u0026#34;OCR:\u0026#34;, \u0026#34;table\u0026#34;: \u0026#34;Table Recognition:\u0026#34;, \u0026#34;formula\u0026#34;: \u0026#34;Formula Recognition:\u0026#34;, \u0026#34;chart\u0026#34;: \u0026#34;Chart Recognition:\u0026#34;, \u0026#34;spotting\u0026#34;: \u0026#34;Spotting:\u0026#34;, \u0026#34;seal\u0026#34;: \u0026#34;Seal Recognition:\u0026#34;, } model = AutoModelForImageTextToText.from_pretrained(model_path, torch_dtype=torch.bfloat16).to(DEVICE).eval() processor = AutoProcessor.from_pretrained(model_path) messages = [ { \u0026#34;role\u0026#34;: \u0026#34;user\u0026#34;, \u0026#34;content\u0026#34;: [ {\u0026#34;type\u0026#34;: \u0026#34;image\u0026#34;, \u0026#34;image\u0026#34;: image}, {\u0026#34;type\u0026#34;: \u0026#34;text\u0026#34;, \u0026#34;text\u0026#34;: PROMPTS[task]}, ] } ] inputs = processor.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors=\u0026#34;pt\u0026#34;, images_kwargs={\u0026#34;size\u0026#34;: {\u0026#34;shortest_edge\u0026#34;: processor.image_processor.min_pixels, \u0026#34;longest_edge\u0026#34;: max_pixels}}, ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=512) result = processor.decode(outputs[0][inputs[\u0026#34;input_ids\u0026#34;].shape[-1]:-1]) print(result) # --------------------------- Use flash-attn to boost performance and reduce memory usage\n# ensure the flash-attn2 is installed pip install flash-attn --no-build-isolation model = AutoModelForImageTextToText.from_pretrained(model_path, torch_dtype=torch.bfloat16, attn_implementation=\u0026#34;flash_attention_2\u0026#34;).to(DEVICE).eval() Performance Document Parsing 1. OmniDocBench v1.5 PaddleOCR-VL-1.5 achieves SOTA performance for overall, text, formula, tables and reading order on OmniDocBench v1.5\nNotes:\nPerformance metrics are cited from the OmniDocBench official leaderboard, except for Gemini-3 Pro, Qwen3-VL-235B-A22B-Instruct and our model, which were evaluated independently.\n2. Real5-OmniDocBench Across all five diverse and challenging scenarios—scanning, warping, screen-photography, illumination, and skew—PaddleOCR-VL-1.5 consistently sets new SOTA records\nNotes:\nReal5-OmniDocBench is a brand-new benchmark oriented toward real-world scenarios, which we constructed based on the OmniDocBench v1.5 dataset. The dataset comprises five distinct scenarios: Scanning, Warping, Screen-photography, Illumination, and Skew. For further details, please refer to Real5-OmniDocBench.\nInference Performance Notes:\nEnd-to-End Inference Performance Comparison on OmniDocBench v1.5. PDF documents were processed in batches of 512 on a single NVIDIA A100 GPU. The reported end-to-end runtime includes both PDF rendering and Markdown generation. All methods rely on their built-in PDF parsing modules and default DPI settings to reflect out-of-the-box performance.\nVisualization Here are some Real-world Document Parsing examples.\nIllumination\nSkew\nScreen Photography Scanning\nWarping\nText Spotting\nSeal Recognition\nAcknowledgments We would like to thank PaddleFormers, Keye, MinerU, OmniDocBench for providing valuable code, model weights and benchmarks. We also appreciate everyone\u0026rsquo;s contribution to this open-source project!\nCitation If you find PaddleOCR-VL-1.5 helpful, feel free to give us a star and citation.\n@misc{cui2026paddleocrvl15multitask09bvlm, title={PaddleOCR-VL-1.5: Towards a Multi-Task 0.9B VLM for Robust In-the-Wild Document Parsing}, author={Cheng Cui and Ting Sun and Suyin Liang and Tingquan Gao and Zelun Zhang and Jiaxuan Liu and Xueqing Wang and Changda Zhou and Hongen Liu and Manhui Lin and Yue Zhang and Yubo Zhang and Yi Liu and Dianhai Yu and Yanjun Ma}, year={2026}, eprint={2601.21957}, archivePrefix={arXiv}, primaryClass={cs.CV}, url={https://arxiv.org/abs/2601.21957}, } ","permalink":"/blog/posts/paddleocr-vl-1.5/","summary":"🚀 We release PaddleOCR-VL-1.5, an upgraded model achieving a new state-of-the-art (SOTA)accuracy of 94.5% on OmniDocBench v1.5.","title":"PaddleOCR-VL-1.5: Towards a Multi-Task 0.9B VLM for Robust In-the-Wild Document Parsing"},{"content":"On January 15, LMArena released its latest rankings. ERNIE-5.0-0110 achieved a score of 1,460, ranking No. 1 among Chinese models and No. 8 globally on the LMArena Text leaderboard. It outperformed several leading models, including GPT-5.1-High and Gemini-2.5-Pro, demonstrating strong text capabilities.\nERNIE-5.0-0110 also delivered an outstanding performance on Math, ranking second globally, behind only GPT-5.2-High, highlighting its strength in complex reasoning and mathematical problem-solving.\nERNIE-5.0 on LMArena is no longer labeled as “Preview”, indicating that the model has completed its preview phase. Achieving the top position among Chinese models upon entering the leaderboard as a formal version, demonstrates the model\u0026rsquo;s continuous breakthroughs in general text capabilities.\nWe welcome you to directly experience the ERNIE series models through https://ernie.baidu.com/.\nMoving forward, we will continue to deepen our technical expertise and promote open collaboration, partnering with developers worldwide to drive innovation in the intelligent era.\n","permalink":"/blog/posts/ernie-5.0-0110-release-on-lmarena/","summary":"On January 15, LMArena released its latest rankings. ERNIE-5.0-0110 achieved a score of 1,460, ranking No. 1 among Chinese models and No. 8 globally on the LMArena Text Arena.","title":"ERNIE-5.0 Tops LMArena Text Leaderboard as No.1 Chinese Model!"},{"content":"Just now, LMArena, the globally recognized platform for evaluating large models, released its latest Vision Arena rankings. Baidu’s ERNIE-5.0-Preview-1220 achieved a score of 1,226 points, ranking No. 1 among Chinese models and No. 8 globally, and standing as the only Chinese large model in the global top 10 for visual understanding.\nWe welcome you to directly experience the ERNIE series models through https://ernie.baidu.com/.\nMoving forward, we will continue to deepen our technical expertise and promote open collaboration, partnering with developers worldwide to drive innovation in the intelligent era.\n","permalink":"/blog/posts/ernie-5.0-preview-1220-release-on-lmarena/","summary":"On January 8, LMArena released its latest rankings. ERNIE-5.0-Preview-1220 achieved a score of 1226, ranking No. 1 in China and No. 8 globally on the LMArena Vision Arena.","title":"ERNIE-5.0-Preview-1220 Becomes the Sole Chinese Model in LMArena Vision Top 10!"},{"content":"Just now, LMArena, the globally recognized platform for evaluating large models, published its latest rankings. Baidu’s ERNIE-5.0-Preview-1203 scored 1,451 points and demonstrated leadership across multiple key dimensions, including Creative Writing and Hard Prompts.\nDevelopers can experience ERNIE-5.0-Preview-1203 on LMArena and explore a new era of language intelligence that merges creativity with reasoning capabilities, starting today.\nWe welcome you to directly experience the ERNIE series models through https://ernie.baidu.com/.\nMoving forward, we will continue to deepen our technical expertise and promote open collaboration, partnering with developers worldwide to drive innovation in the intelligent era.\n","permalink":"/blog/posts/ernie-5.0-preview-1203-release-on-lmarena/","summary":"Just now, LMArena released its latest rankings. Baidu’s ERNIE-5.0-Preview-1203 scored an impressive 1,451 points.","title":"Best Text model from China in LMArena is now ERNIE-5.0-Preview-1203!"},{"content":"Quick update on the Text Arena leaderboard! We’ve just refreshed our standings with the latest ERNIE-5.0-Preview-1103 on LMArena. 🚀 ERNIE-5.0-Preview-1103 holds the top 20 in the most competitive Arena.\nWith upgraded foundational abilities, ERNIE 5.0 achieves state-of-the-art performance across various benchmark evaluations. This time on the Text Arena, ERNIE-5.0-Preview-1103 received 1471 in Software \u0026amp; IT Services—on par with GPT 5.1-high，and 1464 in Coding—matching chat-gpt-4o.\nModel Introduction ERNIE 5.0 is the new-generation foundation model of the ERNIE series, built upon natively unified omni-modal modeling technology. From the ground up, it jointly models text, images, audio, and video, empowering comprehensive multimodal understanding and generation capabilities. With fully upgraded foundational abilities, ERNIE 5.0 achieves state-of-the-art performance across various benchmark evaluations—excelling particularly in multimodal understanding, instruction following, creative writing, factual reasoning, agentic planning, and tool use.\nTechnical Highlights Natively Omni-Modal Modeling: Unlike most multimodal models that rely on late fusion, ERNIE 5.0 integrates text, images, audio, and video data from the start of training. This enables seamless joint input and output across all modalities, achieving omni-modal understanding and generation. Unified Understanding and Generation: ERNIE 5.0 overcomes the long-standing challenge of unified multimodal understanding and generation through deep fusion of perceptual and semantic features across visual and auditory modalities. This enables seamless synergy between understanding and generation, marking a new leap in omni-modal intelligence. Unified Autoregressive Architecture: By discretizing training objectives across modalities and training under a single autoregressive framework, ERNIE 5.0 achieves deep feature fusion and collaborative optimization within one unified architecture. It significantly boosts omni-modal modeling capability and consistency. Massive-Scale Mixture of Experts (MoE) Architecture: Built on the PaddlePaddle deep learning framework, ERNIE 5.0 features over 2 trillion parameters, ranking among the largest publicly disclosed models globally. Its ultra-sparse expert design (with \u0026lt;3% active parameters) delivers powerful multimodal performance while dramatically reducing computation and inference costs. Agentic Capability Enhancement for Long-Horizon Tasks: By synthesizing long-horizon task trajectory data from large-scale real-world/simulated tool environments, we perform data augmentation during pre-training and post-training. The model is then trained end-to-end with multi-round reinforcement learning, leveraging Chain-of-Thought and Chain-of-Action. This approach significantly improves the model\u0026rsquo;s agentic and tool-use capabilities. ","permalink":"/blog/posts/ernie-5.0-preview-1103-release-on-lmarena/","summary":"We’ve just refreshed our standings with the latest ERNIE-5.0-Preview-1103 on LMArena. 🚀 ERNIE-5.0-Preview-1103 holds the top 20 in the most competitive Arena.","title":"ERNIE-5.0-Preview-1103 landed on the LMArena Text Leaderboard!"},{"content":"Excited to share another milestone! Our newly released ERNIE-5.0-Preview-1120 has entered the LMArena Vision Leaderboard for the very first time! It lands straight in the Top 15 with a score of 1206, ranked top 1 in domestic models, on par with Claude Sonnet 4 and GPT-5-high! 🚀\nERNIE-5.0 is natively multimodal. Following its impressive results on the text leaderboard, it has also demonstrated excellent performance alongside leading models in the vision arena.\n","permalink":"/blog/posts/ernie-5.0-preview-1120-release-on-lmarena/","summary":"ERNIE-5.0-Preview-1120 now ranks #1 in domestic on the LMArena Vision leaderboard","title":"ERNIE-5.0-Preview-1120, ready for testing in LMArena!"},{"content":"BAIDU AI Studio Demo | Hugging Face Demo | GitHub | Hugging Face | BAIDU AI Studio | ERNIE Bot\nModel Highlights Built upon the powerful ERNIE-4.5-VL-28B-A3B architecture, the newly upgraded ERNIE-4.5-VL-28B-A3B-Thinking achieves a remarkable leap forward in multimodal reasoning capabilities. 🧠✨ Through an extensive mid-training phase, the model absorbed a vast and highly diverse corpus of premium visual-language reasoning data. This massive-scale training process dramatically boosted the model\u0026rsquo;s representation power while deepening the semantic alignment between visual and language modalities—unlocking unprecedented capabilities in nuanced visual-textual reasoning. 📊\nThe model leverages cutting-edge multimodal reinforcement learning techniques on verifiable tasks, integrating GSPO and IcePop strategies to stabilize MoE training combined with dynamic difficulty sampling for exceptional learning efficiency. ⚡ Responding to strong community demand, we\u0026rsquo;ve significantly strengthened the model\u0026rsquo;s grounding performance with improved instruction-following capabilities, making visual grounding functions more accessible than ever. 🎯 Additionally, our innovative \u0026ldquo;Thinking with Images\u0026rdquo; feature, when paired with tools like image zooming and image search, dramatically elevates the model\u0026rsquo;s ability to process fine-grained details and handle long-tail visual knowledge. 🔍🖼️\nTogether, these enhancements form a critical foundation for developing sophisticated multimodal agents, empowering developers and researchers to create next-generation AI applications that push the boundaries of what\u0026rsquo;s possible in visual-language understanding. 🤖🌟\nKey Capabilities As a lightweight model that activates only 3B parameters ⚡, ERNIE-4.5-VL-28B-A3B-Thinking closely matches the performance of the industry\u0026rsquo;s top flagship models across various benchmarks. 🚀\nVisual Reasoning 🧠👁️: Bolstered by large-scale reinforcement learning, the model demonstrates exceptional multi-step reasoning, chart analysis, and causal reasoning capabilities in complex visual tasks! 📊✨ STEM Reasoning 🔬📐: Leveraging its powerful visual abilities, the model achieves a leap in performance on STEM tasks like solving problems from photos, easily handling even complex questions! 🎯💡 Visual Grounding 📍🎨: Features more precise grounding and flexible instruction execution, easily triggering grounding functions in complex industrial scenarios for a significant efficiency boost! ⚙️💪 Thinking with Images 🤔🔍: The model thinks like a human, capable of freely zooming in and out of images to grasp every detail and uncover all information. 🖼️✨ Tool Utilization 🛠️⚡: Empowered by robust tool-calling capabilities, the model can instantly use functions like image search to easily identify long-tail knowledge and achieve comprehensive information retrieval! 🔎📚 Video Understanding 🎬🎥: The model possesses outstanding temporal awareness and event localization abilities, accurately identifying content changes across different time segments in a video, making video analysis smarter and more efficient! ⏱️🌟 Showcase Visual Reasoning Case: Analyzing a Peak-Time Chart to Identify Optimal Visiting Hours In this scenario, the model receives an image showing a “Peak Time Reminder” chart that visualizes customer traffic intensity across different time slots during the week.\nThe user asks the model to determine the optimal visiting periods between November 8 and 12, 2025, avoiding high-traffic hours and business peak days.\nERNIE-4.5-VL-28B-A3B-Thinking first determines the weekday corresponding to each date in the given range, then interprets the chart’s structure, identifies the low-density intervals (12:00–14:00), cross-references them with the weekday and business schedule, and outputs a clear, structured recommendation for the best visiting times.\nSTEM Reasoning Case: Solving a Bridge Circuit to Compute Equivalent Resistance In this example, the model is presented with a non-trivial bridge circuit and asked to calculate the equivalent resistance between nodes A and B.\nThis type of problem cannot be solved by direct series–parallel reduction and requires a full multi-step analysis using Ohm’s Law and Kirchhoff’s Current Law (KCL).\nERNIE-4.5-VL-28B-A3B-Thinking interprets the circuit diagram, identifies all node relationships, formulates current equations, and symbolically solves for the voltage and current ratios.\nThe model derives the correct analytical result, R = 7/5 Ω (1.4 Ω), while presenting a logically consistent reasoning chain.\nVisual Grounding Case: Detecting People Wearing Suits and Outputting Structured Coordinates In this case, the model is given a image containing multiple human figures and an instruction: “Identify all people wearing suits and output their bounding box coordinates in JSON format.”\nERNIE-4.5-VL-28B-A3B-Thinking correctly follows the instruction, detecting every relevant individual and returning a complete list of bounding boxes with precise numerical coordinates.\nThe output reflects both its visual grounding capability — linking language prompts with visual regions — and its instruction-following consistency in structured output generation.\nFigure: Visualization of the model’s grounding output — bounding boxes correspond to the JSON coordinates generated for “people wearing suits.” Thinking with Images Case: Identifying Text on a Blue Sign through Image Zooming In this example, the model is asked: “What’s the text of the sign with a blue background on the wall next to the sidewalk?”\nERNIE-4.5-VL-28B-A3B-Thinking analyzes the image, locates the region of interest, and autonomously calls the image zoom-in tool to examine the sign’s details more clearly.\nAfter zooming in, the model accurately identifies the white text on the blue sign as “HOTEL BUZA.”\nThis case demonstrates the model’s Think with Images capability, which enables detailed visual reasoning by dynamically focusing on fine-grained areas.\nTool Utilization Case: Identifying a Plush Toy through External Image Search In this example, the model is shown an image of a round yellow cartoon chicken and asked: “What is this?”\nRecognizing that internal knowledge may not be sufficient, ERNIE-4.5-VL-28B-A3B-Thinking autonomously decides to call an image search tool to retrieve visually similar images and related product information from the web.\nIt gathers multiple candidate results, compares visual attributes and contextual cues, and determines that the object is “Dundun,” a plush toy character associated with the MINISO brand.\nThis case illustrates the model’s tool utilization capability — performing multi-step reasoning, invoking external tools when necessary, and integrating retrieved evidence into a coherent final answer.\nVideo Understanding Case: Extracting Subtitles and Locating Specific Scenes within a Video In this case, the model is presented with a video and performs two related video understanding tasks.\nFirst, it extracts all on-screen subtitles together with their timestamps, generating a structured output that maps each sentence to its moment of appearance.\nSecond, when asked “Which parts of the video were filmed on a bridge?”, the model analyzes visual cues such as structures, lighting, and perspective, identifying the relevant time intervals at approximately 17s, 37s, and 47s.\nThis example illustrates ERNIE-4.5-VL-28B-A3B-Thinking’s integrated ability in video text extraction, temporal reasoning, and spatiotemporal scene understanding, enabling accurate and interpretable analysis of dynamic visual content.\nYour browser does not support the video tag. Quickstart Using transformers Library Here is an example of how to use the transformers library for inference:\nimport torch from transformers import AutoProcessor, AutoTokenizer, AutoModelForCausalLM model_path = \u0026#39;baidu/ERNIE-4.5-VL-28B-A3B-Thinking\u0026#39; model = AutoModelForCausalLM.from_pretrained( model_path, device_map=\u0026#34;auto\u0026#34;, dtype=torch.bfloat16, trust_remote_code=True ) processor = AutoProcessor.from_pretrained(model_path, trust_remote_code=True) model.add_image_preprocess(processor) messages = [ { \u0026#34;role\u0026#34;: \u0026#34;user\u0026#34;, \u0026#34;content\u0026#34;: [ { \u0026#34;type\u0026#34;: \u0026#34;text\u0026#34;, \u0026#34;text\u0026#34;: \u0026#34;What color clothes is the girl in the picture wearing?\u0026#34; }, { \u0026#34;type\u0026#34;: \u0026#34;image_url\u0026#34;, \u0026#34;image_url\u0026#34;: { \u0026#34;url\u0026#34;: \u0026#34;https://paddlenlp.bj.bcebos.com/datasets/paddlemix/demo_images/example1.jpg\u0026#34; } }, ] }, ] text = processor.tokenizer.apply_chat_template( messages, tokenize=False, add_generation_prompt=True, ) image_inputs, video_inputs = processor.process_vision_info(messages) inputs = processor( text=[text], images=image_inputs, videos=video_inputs, padding=True, return_tensors=\u0026#34;pt\u0026#34;, ) device = next(model.parameters()).device inputs = inputs.to(device) generated_ids = model.generate( inputs=inputs[\u0026#39;input_ids\u0026#39;].to(device), **inputs, max_new_tokens=1024, use_cache=False ) output_text = processor.decode(generated_ids[0][len(inputs[\u0026#39;input_ids\u0026#39;][0]):]) print(output_text) vLLM Inference Install the vLLM main branch\npip install uv uv pip install -U vllm --pre \\ --extra-index-url https://wheels.vllm.ai/nightly \\ --extra-index-url https://download.pytorch.org/whl/cu129 \\ --index-strategy unsafe-best-match Run vLLM\n# 80G*1 GPU，If an error occurs, add the --gpu-memory-utilization 0.95 and try again vllm serve baidu/ERNIE-4.5-VL-28B-A3B-Thinking --trust-remote-code Run vLLM using reasoning-parser and tool-call-parser\n# 80G*1 GPU，If an error occurs, add the --gpu-memory-utilization 0.95 and try again vllm serve baidu/ERNIE-4.5-VL-28B-A3B-Thinking --trust-remote-code \\ --reasoning-parser ernie45 \\ --tool-call-parser ernie45 \\ --enable-auto-tool-choice FastDeploy Inference Quickly deploy services using FastDeploy as shown below. For more detailed usage, refer to the FastDeploy GitHub Repository.\nNote: For single-card deployment, at least 80GB of GPU memory is required.\nfastdeploy serve --model baidu/ERNIE-4.5-VL-28B-A3B-Thinking \\ --max-model-len 131072 \\ --max-num-seqs 32 \\ --port 8180 \\ --quantization wint8 \\ --reasoning-parser ernie-45-vl-thinking \\ --tool-call-parser ernie-45-vl-thinking \\ --mm-processor-kwargs \u0026#39;{\u0026#34;image_max_pixels\u0026#34;: 12845056 }\u0026#39; Finetuning with ERNIEKit ERNIEKit is a training toolkit based on PaddlePaddle, specifically designed for the ERNIE series of open-source large models. It provides comprehensive support for scenarios such as instruction fine-tuning (SFT, LoRA) and alignment training (DPO), ensuring optimal performance.\nUsage Examples:\n# Download model huggingface-cli download baidu/ERNIE-4.5-VL-28B-A3B-Thinking --local-dir baidu/ERNIE-4.5-VL-28B-A3B-Thinking # SFT erniekit train examples/configs/ERNIE-4.5-VL-28B-A3B-Thinking/sft/run_sft_lora_8k.yaml # SFT (Function Call) erniekit train examples/configs/ERNIE-4.5-VL-28B-A3B-Thinking/sft_function_call/run_sft_8k.yaml For more detailed examples, including SFT with LoRA, multi-GPU configurations, and advanced scripts, please refer to the examples folder within the ERNIEKit repository.\nLicense The ERNIE 4.5 models are provided under the Apache License 2.0. This license permits commercial use, subject to its terms and conditions. Copyright (c) 2025 Baidu, Inc. All Rights Reserved.\nCitation If you find ERNIE 4.5 useful or wish to use it in your projects, please kindly cite our technical report:\n@misc{ernie2025technicalreport, title={ERNIE 4.5 Technical Report}, author={Baidu-ERNIE-Team}, year={2025}, primaryClass={cs.CL}, howpublished={\\url{https://ernie.baidu.com/blog/publication/ERNIE_Technical_Report.pdf}} } ","permalink":"/blog/posts/ernie-4.5-vl-28b-a3b-thinking/","summary":"We release ERNIE-4.5-VL-28B-A3B-Thinking, a multimodal reasoning model that achieves SOTA performance while activating only 3B parameters.","title":"ERNIE-4.5-VL-28B-A3B-Thinking: A Breakthrough in Multimodal AI"},{"content":"We’re thrilled to share that the ERNIE-5.0-Preview-1022 now ranks #2(tied) globally on the LMArena Text leaderboard — one of the world’s most recognized benchmarks for large language models driven by real-world use.\nAs part of our commitment, we plan to officially release ERNIE-5.0-Preview-1022 in the near future. This will allow more users and developers to experience and evaluate the model firsthand.\nStay tuned — we’ll be sharing it on ernie.baidu.com soon!\n","permalink":"/blog/posts/ernie-5.0-preview-1022-release-on-lmarena/","summary":"ERNIE-5.0-Preview-1022 now ranks #2 globally on the LMArena Text leaderboard","title":"ERNIE-5.0-Preview-1022, ready for testing in LMArena!"},{"content":"Demo | GitHub | Hugging Face | Technical Report\nIntroducing PaddleOCR-VL We are excited to release PaddleOCR-VL, a SOTA and resource-efficient model tailored for document parsing. Its core component is PaddleOCR-VL-0.9B, a compact yet powerful vision-language model (VLM) that integrates a NaViT-style dynamic resolution visual encoder with the ERNIE-4.5-0.3B language model to enable accurate element recognition. This innovative model efficiently supports 109 languages and excels in recognizing complex elements (e.g., text, tables, formulas, and charts), while maintaining minimal resource consumption. Through comprehensive evaluations on widely used public benchmarks and in-house benchmarks, PaddleOCR-VL achieves SOTA performance in both page-level document parsing and element-level recognition. It significantly outperforms existing solutions, exhibits strong competitiveness against top-tier VLMs, and delivers fast inference speeds. These strengths make it highly suitable for practical deployment in real-world scenarios.\nCore Features Compact yet Powerful VLM Architecture: We present a novel vision-language model that is specifically designed for resource-efficient inference, achieving outstanding performance in element recognition. By integrating a NaViT-style dynamic high-resolution visual encoder with the lightweight ERNIE-4.5-0.3B language model, we significantly enhance the model’s recognition capabilities and decoding efficiency. This integration maintains high accuracy while reducing computational demands, making it well-suited for efficient and practical document processing applications.\nSOTA Performance on Document Parsing: PaddleOCR-VL achieves state-of-the-art performance in both page-level document parsing and element-level recognition. It significantly outperforms existing pipeline-based solutions and exhibiting strong competitiveness against leading vision-language models (VLMs) in document parsing. Moreover, it excels in recognizing complex document elements, such as text, tables, formulas, and charts, making it suitable for a wide range of challenging content types, including handwritten text and historical documents. This makes it highly versatile and suitable for a wide range of document types and scenarios.\nMultilingual Support: PaddleOCR-VL Supports 109 languages, covering major global languages, including but not limited to Chinese, English, Japanese, Latin, and Korean, as well as languages with different scripts and structures, such as Russian (Cyrillic script), Arabic, Hindi (Devanagari script), and Thai. This broad language coverage substantially enhances the applicability of our system to multilingual and globalized document processing scenarios.\nArchitecture PaddleOCR-VL decomposes the complex task of document parsing into a two stages. The first stage, PP-DocLayoutV2, is responsible for layout analysis, where it localizes semantic regions and predicts their reading order. Subsequently, the second stage, PaddleOCR-VL-0.9B, leverages these layout predictions to perform fine-grained recognition of diverse content, including text, tables, formulas, and charts. Finally, a lightweight post-processing module aggregates the outputs from both stages and formats the final document into structured Markdown and JSON.\nShow Cases We have provided several comprehensive document parsing examples using PaddleOCR-VL below. For even more examples, please consult our technical report or try the online demo.\nGetting Started Install Dependencies Install PaddlePaddle and PaddleOCR:\npython -m pip install paddlepaddle-gpu==3.2.0 -i https://www.paddlepaddle.org.cn/packages/stable/cu126/ python -m pip install -U \u0026#34;paddleocr[doc-parser]\u0026#34; python -m pip install https://paddle-whl.bj.bcebos.com/nightly/cu126/safetensors/safetensors-0.6.2.dev0-cp38-abi3-linux_x86_64.whl Basic Usage CLI usage:\npaddleocr doc_parser -i https://paddle-model-ecology.bj.bcebos.com/paddlex/imgs/demo_image/pp_ocr_vl_demo.png Python API usage:\nfrom paddleocr import PaddleOCRVL pipeline = PaddleOCRVL() output = pipeline.predict(\u0026#34;https://paddle-model-ecology.bj.bcebos.com/paddlex/imgs/demo_image/pp_ocr_vl_demo.png\u0026#34;) for res in output: res.print() res.save_to_json(save_path=\u0026#34;output\u0026#34;) res.save_to_markdown(save_path=\u0026#34;output\u0026#34;) Accelerate VLM Inference via vLLM Start the VLM inference server (the default port is 8080): docker run \\ --rm \\ --gpus all \\ --network host \\ ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddlepaddle/paddlex-genai-vllm-server # You can also use ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddlepaddle/paddlex-genai-vllm-server for the SGLang server Call the PaddleOCR CLI or Python API: paddleocr doc_parser \\ -i https://paddle-model-ecology.bj.bcebos.com/paddlex/imgs/demo_image/pp_ocr_vl_demo.png \\ --vl_rec_backend vllm-server \\ --vl_rec_server_url http://127.0.0.1:8080 from paddleocr import PaddleOCRVL pipeline = PaddleOCRVL(vl_rec_backend=\u0026#34;vllm-server\u0026#34;, vl_rec_server_url=\u0026#34;http://127.0.0.1:8080\u0026#34;) output = pipeline.predict(\u0026#34;https://paddle-model-ecology.bj.bcebos.com/paddlex/imgs/demo_image/pp_ocr_vl_demo.png\u0026#34;) for res in output: res.print() res.save_to_json(save_path=\u0026#34;output\u0026#34;) res.save_to_markdown(save_path=\u0026#34;output\u0026#34;) License The PaddleOCR-VL model is provided under the Apache License 2.0.\nCitation If you find PaddleOCR-VL useful or wish to use it in your projects, please kindly cite our technical report:\n@misc{paddleocrvl2025technicalreport, title={PaddleOCR-VL: Boosting Multilingual Document Parsing via a 0.9B Ultra-Compact Vision-Language Model}, author={Cui, C. et al.}, year={2025}, primaryClass={cs.CL}, howpublished={\\url{https://ernie.baidu.com/blog/publication/PaddleOCR-VL_Technical_Report.pdf}} } ","permalink":"/blog/posts/paddleocr-vl/","summary":"We are excited to release PaddleOCR-VL, a SOTA and resource-efficient model tailored for document parsing.","title":"PaddleOCR-VL: Boosting Multilingual Document Parsing via a 0.9B Ultra-Compact Vision-Language Model"},{"content":"We\u0026rsquo;ve just rolled out a powerful update for the ERNIE-4.5-21B-A3B and ERNIE-4.5-300B-A47B models that speeds up inference for long-context tasks. By integrating a new sparse attention technology, you can now process long documents and conversations much faster, with minimal impact on accuracy.\nThis update is currently available for the \u0026ldquo;-Paddle\u0026rdquo; versions of the ERNIE-4.5-21B-A3B and ERNIE-4.5-300B-A47B models when deployed with FastDeploy.\nWhat You\u0026rsquo;ll Experience: A Major Performance Boost 🚀 With this update, you\u0026rsquo;ll see significant improvements in speed and efficiency. The gains are substantial across both updated models:\nPerformance Gains for ERNIE-4.5-21B-A3B Metric Before (Full Attention) After (Sparse Attention) Improvement Queries Per Second (QPS) 0.101 0.150 +48% Decode Speed (token/s) 13.32 18.12 +36% Time to First Token (s) 8.082 5.466 -48% End-to-End Latency (s) 61.400 42.157 -46% Performance Gains for ERNIE-4.5-300B-A47B Metric Before (Full Attention) After (Sparse Attention) Improvement Queries Per Second (QPS) 0.066 0.081 +23% Decode Speed (token/s) 5.07 6.75 +33% Time to First Token (s) 13.812 10.584 -30% End-to-End Latency (s) 164.704 132.745 -24% Performance evaluated on the longbook_sum_eng subset from InfiniteBench with a mean input length of ~113K tokens.\nIntroducing PLAS This speed-up is powered by PLAS (Pluggable Lightweight Attention for Sparsity), a novel sparse attention mechanism.\nInstead of the traditional attention method that compares every single token in a long text against every other token, PLAS works smarter. It divides the text into blocks and uses a small, learnable module to intelligently select only the most relevant blocks for its calculations.\nThe best part is its \u0026ldquo;pluggable\u0026rdquo; nature. We can add PLAS to a fully trained model without changing the original weights, ensuring that the model\u0026rsquo;s core knowledge remains intact.\nAffect to Accuracy We know that performance can\u0026rsquo;t come at the cost of accuracy. The PLAS method was specifically designed to be nearly lossless. Our evaluations on long-context benchmarks show that the difference in precision is negligible for both models.\nModel Benchmark Full Attention Sparse Attention (PLAS) ERNIE-4.5-21B-A3B LongBenchV2 31.48 31.45 Ruler (128K) 25.48 25.05 ERNIE-4.5-300B-A47B LongBenchV2 41.02 41.05 Ruler (128K) 58.18 57.85 Evaluation results show minimal precision changes, ensuring reliable model output.\nHow to Get Started If you\u0026rsquo;re using a -Paddle version of the ERNIE-4.5-21B-A3B or ERNIE-4.5-300B-A47B models with FastDeploy, enabling sparse attention is simple.\nJust set the environment variable and add the PLAS configuration to your launch command.\n# Set the environment variable to enable the PLAS attention backend export FD_ATTENTION_BACKEND=\u0026#34;PLAS_ATTN\u0026#34; # Launch the API server with your model (e.g., ERNIE-4.5-300B-A47B-Paddle) and PLAS configuration python -m fastdeploy.entrypoints.openai.api_server \\ --model baidu/ERNIE-4.5-300B-A47B-Paddle \\ --port 8180 \\ --metrics-port 8181 \\ --quantization wint4 \\ --tensor-parallel-size 4 \\ --engine-worker-queue-port 8182 \\ --max-model-len 131072 \\ --max-num-seqs 32 \\ --max-num-batched-tokens 8192 \\ --enable-chunked-prefill \\ --plas-attention-config \u0026#39;{\u0026#34;plas_encoder_top_k_left\u0026#34;: 50, \u0026#34;plas_encoder_top_k_right\u0026#34;: 60,\u0026#34;plas_decoder_top_k_left\u0026#34;: 100, \u0026#34;plas_decoder_top_k_right\u0026#34;: 120}\u0026#39; Command example from the ERNIE-4.5-300B-A47B-Paddle model card.\nFor more technical details, you can refer to the official PLAS Attention documentation.\n","permalink":"/blog/posts/plas/","summary":"How the new PLAS sparse attention update delivers performance gains for long-context inference on ERNIE 4.5 models.","title":"ERNIE 4.5 Gets a Major Inference Speed Boost"},{"content":"Introduction to FastDeploy 2.0 As large models such as the ERNIE 4.5 family continue to be open-sourced, interest in their inference performance and deployment efficiency has multiplied across both research and industry. FastDeploy 2.0, built on the PaddlePaddle framework, addresses this demand by offering an end-to-end toolkit for efficient deployment and high-performance inference of large models. The current release of FastDeploy supports several widely used open-source models and introduces a high-throughput inference architecture based on Expert Parallel (EP) and Prefill-Decode Disaggregation(PD). In benchmark tests with the ERNIE 4.5 model, FastDeploy 2.0 achieves 56K/21K tokens per second in input/output throughput. It also includes a near-lossless 2-bit quantization strategy, enabling trillion-parameter models to run on a single GPU. FastDeploy 2.0 is designed to reduce the complexity of deploying large models, improve inference efficiency, and optimize resource utilization, making it easier for researchers and enterprises to bring large-model applications into practical use.\nFastdeploy 2.0 Highlights FastDeploy 2.0 achieves highly efficient inference deployment for large-scale models by the following key innovations:\nUnified interface support: FastDeploy 2.0 is compatible with the OpenAI API protocol and fully aligned with the vLLM interface, supporting both local and service-based inference modes, which assures easy integration and utilization. Integrated high-performance optimizations: The toolkit incorporates a range of inference acceleration techniques—including low-bit quantized operators, CUDA Graph, speculative decoding, context caching, segmented prefill, and Prefill/Decode Disaggregation—enabling ERNIE 4.5 models to achieve strong inference throughput. Extensive quantization support: FastDeploy 2.0 supports weight, activation, and KV cache quantization down to 8-bit, 4-bit, and 2-bit levels, allowing trillion-parameter models to run on a single GPU with minimized accuracy loss. Optimized for heterogeneous hardware: Inference is optimized across a wide range of hardware platforms, including NVIDIA GPUs, KUNLUNXIN P800, Iluvatar BI-V150, Hygon K100AI, and Enflame S60. Production-ready deployment features: For practical industrial scenarios, FastDeploy 2.0 includes traffic scheduling features such as real-time load-aware scheduling and distributed load balancing for scalable and stable deployment. Performance and Benchmark Results In addition to deployment capabilities, FastDeploy 2.0 integrates a range of quantization techniques. With a single command, users can enable weight-only 8-bit, 4-bit, or FP8 online quantized inference. Static W4A8 quantization is also supported. Built on deeply optimized CUTLASS kernels, the system dynamically selects the most suitable quantization strategy based on model architecture and hardware platform.\nFastDeploy 2.0 also introduces a high-performance, modular speculative decoding framework. Performance is further enhanced through kernel-level fusion for pre/post-processing, dynamic batching, parallel verification, and virtual padding to accelerate token validation. The system is fully compatible with context caching, Prefill-Decode Disaggregation, Expert Parallel, and chunked prefill.\nFor MTP (Multi-Turn Prompt) inference, FastDeploy supports logical and physical address separation of the KV cache to enable context caching across various layers of both the target model and the MTP module. Prefill-Decode Disaggregation is also applied to MTP communication to reduce overhead and improve end-to-end inference throughput.\nAdditionally, FastDeploy 2.0 provides optimized CUDA Graph support. With PaddlePaddle’s dynamic-to-static conversion, the toolkit enables both static and dynamic graph capture. In testing with lightweight ERNIE models, decoding speed was improved by more than 2 times on average.\nSingle-Node Deployment of ERNIE-4.5-300B-A47B FastDeploy supports weight-only 4-bit inference, enabling deployment with as few as 4 GPUs while maintaining near-lossless accuracy. When deploying ERNIE-4.5-300B-A47B model on a single-GPU, it delivers approximately 23% higher QPS compared to full-weight 8-bit inference, reducing hardware requirements and improving efficiency at the same time. To further optimize MoE quantization, FastDeploy introduces W4A8 quantization along with 8-bit compression for KV cache, resulting in an additional ~40% QPS gain. On a single H800 GPU, the achieved TPS outperforms the 8-GPU vLLM deployment of WINT4 DeepSeek by 198%.\nSingle-Node Deployment of Lightweight ERNIE 4.5 Models Under the condition of 1.1K input sequence length, the A3B model on an H800 GPU maintained TPOT within 25 ms, while the 0.3B and 0.6B models on an A30 GPU kept TPOT within 16 ms.\nCompared to deploying Qwen3 models of lightweight scale, FastDeploy demonstrated notable performance advantages by achieving 5% and 12% higher throughput than vLLM. For lightweight ERNIE 4.5 MoE models, throughput outpaced similar-scale Qwen3 models by 99% and 118%, respectively— highlighting FastDeploy’s efficiency on small-scale MoE deployments.\n2-Bit Quantization PaddlePaddle\u0026rsquo;s 2-bit quantization approach reduces MoE weights from BF16 to 2-bit, significantly lowering memory fingerprint and deployment resource requirements during inference. For the 300B-parameter ERNIE-4.5 model, this method compresses weight storage from 600 GB to 89 GB, enabling deployment on a single 141 GB NVIDIA GPU.\nThe 2-bit quantization is based on convolutional encoding and inherits ideas from Trellis Code Quantization and Bitshift Trellis, while introducing substantial improvements to the codebook design and encoding algorithm. Such enhancements reduce quantization loss and further improve inference efficiency.\nCompared to conventional scalar quantization, PaddlePaddle’s method achieves better accuracy; compared to traditional vector quantization, it delivers faster inference speed. Quantized ERNIE-4.5-300B-A47B models retain near-lossless accuracy across multiple benchmark datasets.\nTest Set IFEval BBH DROP GSM8K CMath CMMLU WINT4 88.17 94.43 91.17 96.21 96.50 89.92 WINT2 85.40 92.02 89.97 95.98 96.00 86.22 Large-Scale Distributed Inference FastDeploy enables large-scale distributed inference for MoE models through Expert Parallelism (EP). It currently supports a fully separated deployment of the ERNIE-4.5-300B-A47B model.\nDispatch and combination operations across devices are handled by FastDeploy’s DeepEP engine, which manages both intra-node and inter-node communication. The low-latency mode of DeepEP has been further enhanced through a two-stage optimization process, resulting in a 2 times improvement in communication performance.\nThe strategy can be summarized as follows: when the expert index maps to a local GPU, data is transferred via NVLink. When the target expert resides on a remote machine, a two-stage transmission is employed—first via RDMA to the corresponding rank on the destination machine, and then via NVLink to the target GPU.\nIn practice, however, implementing low-latency mode poses challenges. To avoid introducing CPU-side synchronization, metadata cannot be sent in advance, resulting in increased complexity at each communication stage. Meanwhile, due to the involvement of intra-GPU, inter-GPU, and inter-node transfers—each with coupled data directions—naive memory ordering can cause performance degradation. Proper handling of memory consistency becomes critical.\nTo address these challenges, FastDeploy implements metadata-free signaling through a kernel-level design using three layers of atomic semaphore mappings across seven signaling types. For memory consistency control, FastDeploy employs fine-grained ordering mechanisms across the SM, GPU, and system layers. This ensures accuracy while minimizing performance overhead during multi-stage data transfers.\nTo support efficient KV cache transmission, FastDeploy includes a lightweight, custom-built RDMA-based communication library that requires only a basic RDMA runtime environment. The library supports both NVIDIA GPUs and KUNLUNXIN XPUs, and is designed for ease of deployment.\nUnlike existing solutions, FastDeploy’s implementation introduces several enhancements to improve performance, including a reduced number of CQEs and support for PCIe Relaxed Ordering. In benchmark tests using Mellanox ConnectX-7 400G NICs, both FastDeploy and Mooncake implementations fully utilized bandwidth under multi-threaded load, approaching the theoretical hardware limits. In single-threaded scenarios, FastDeploy achieved a 1.1 to 6.9 times higher throughput compared to Mooncake.\nThe ERNIE-4.5-300B-A47B model, when optimized with W4A8 quantization, KV cache quantization, communication improvements, KV cache transfer optimization, and MTP speculative decoding, reached 50 ms TPOT on H800 hardware for 2K-input / 400-output sequences, with throughput of up to 56K/21K tokens per second. The data represents a 17% improvement in output TPS over the baseline reported in the original ERNIE 4.5 technical report.\nReal-Time Load-Aware and Distributed Load Balancing Scheduling Inference tasks typically involve varying input lengths, with output lengths tied to the specific task, making them difficult to predict in advance. Most open-source scheduling solutions rely on round-robin or input-length-based approaches, which fall short of ensuring global load balancing for dynamic outputs in inference clusters. FastDeploy leverages a dedicated cache server to enable real-time global load awareness and a distributed load-balancing strategy. This feature can be activated directly via startup parameters without separate deployment, significantly improving cluster throughput and Time to First Token (TTFT) metrics.\nIn hybrid PD deployment scenarios, the scheduler employs a proactive pull-based approach combined with a Work Stealing strategy for load balancing. Each inference instance maintains a task queue in the cache, allowing idle nodes in the cluster to \u0026ldquo;steal\u0026rdquo; tasks from high-load nodes for processing. This effectively mitigates long-tail load issues, boosting overall service performance. In Prefill-Decode Disaggregation(PD) scenarios, the scheduler operates as a distributed system. Inference instances register their information in the cache and periodically report real-time load to update their status. The scheduler monitors cluster load in real time via the cache. During scheduling, it first identifies a set of low-load nodes based on the input token count of the inference request, then randomly selects a node from this set for task assignment. This approach minimizes load fluctuations caused by multiple schedulers directing requests to the same instance within a single synchronization cycle. Multi-Hardware Inference Deployment Most deployment tools for large models in the open-source community primarily support NVIDIA and AMD GPUs. FastDeploy not only optimizes for NVIDIA GPUs but also comprehensively addresses diverse hardware deployment needs. FastDeploy introduces a hardware adaptation layer that abstracts the underlying compute infrastructure, provides a unified model invocation interface, and efficiently supports various device backends. The layer includes kernel implementations in CUDA, Triton, and C++, among others. Currently, it enables high-performance inference on a wide range of hardware, including KUNLUNXIN P800, Iluvatar BI-V150, Hygon K100AI, and Enflame S60 (see examples later for specific usage details).\nGetting Started with FastDeploy 2.0 Local Offline Deployment Example from fastdeploy import LLM, SamplingParams sampling_params = SamplingParams(top_p=0.95) llm = LLM(model=\u0026#34;ERNIE-4.5-0.3B\u0026#34;) outputs = llm.chat(messages=[{\u0026#34;role\u0026#34;: \u0026#34;user\u0026#34;, \u0026#34;content\u0026#34;: \u0026#34;Write me a poem about yourself\u0026#34;}], sampling_params) Service-Based Inference Deployment Example Launching the service with a single command python -m fastdeploy.entrypoints.openai.api_server --model baidu/ERNIE-4.5-0.3B-Paddle --max-model-len 32768 After launching, request the service using the following commands: curl -X POST \u0026#34;http://0.0.0.0:8180/v1/chat/completions\u0026#34; \\ -H \u0026#34;Content-Type: application/json\u0026#34; \\ -d \u0026#39;{\u0026#34;messages\u0026#34;: [{\u0026#34;role\u0026#34;: \u0026#34;user\u0026#34;, \u0026#34;content\u0026#34;: \u0026#34;Write me a poem about yourself\u0026#34;}]}\u0026#39; Once the service is up and running, you can easily integrate with existing tools and workflows using the OpenAI-compatible API provided by FastDeploy 2.0. For detailed documentation, installation guides, and advanced configuration options, please refer to the FastDeploy 2.0 repository and documentation.\nMulti-Hardware Efficient Deployment Example As an example, deploying the ERNIE-4.5-300B-A47B-Paddle model on KUNLUNXIN P800 hardware can be done by following these steps:\n# Step1. Create and access the container using a precompiled image. mkdir Work cd Work docker pull ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddlepaddle/fastdeploy-xpu:2.0.0 docker run --name fastdeploy-xpu --net=host -itd --privileged -v $PWD:/Work -w /Work \\ ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddlepaddle/fastdeploy-xpu:2.0.0 \\ /bin/bash docker exec -it fastdeploy-xpu /bin/bash # Step2. Launch an OpenAI API-compatible service based on the ERNIE-4.5-300B-A47B-Paddle model. export XPU_VISIBLE_DEVICES=\u0026#34;0,1,2,3\u0026#34; or \u0026#34;4,5,6,7\u0026#34; python -m fastdeploy.entrypoints.openai.api_server \\ --model baidu/ERNIE-4.5-300B-A47B-Paddle \\ --port 8188 \\ --tensor-parallel-size 4 \\ --max-model-len 32768 \\ --max-num-seqs 64 \\ --quantization \u0026#34;wint4\u0026#34; \\ --gpu-memory-utilization 0.9 # Step3. Send a sample request using curl. curl -X POST \u0026#34;http://0.0.0.0:8188/v1/chat/completions\u0026#34; \\ -H \u0026#34;Content-Type: application/json\u0026#34; \\ -d \u0026#39;{\u0026#34;messages\u0026#34;: [{\u0026#34;role\u0026#34;: \u0026#34;user\u0026#34;, \u0026#34;content\u0026#34;: \u0026#34;Where is the capital of China?\u0026#34;}]}\u0026#39; Resources Github Repository: FastDeploy ERNIE 4.5 open models ","permalink":"/blog/posts/fastdeploy2.0/","summary":"As large models such as the ERNIE 4.5 family continue to be open-sourced, interest in their inference performance and deployment efficiency has multiplied across both research and industry. FastDeploy 2.0, built on the PaddlePaddle framework, addresses this demand by offering an end-to-end toolkit for efficient deployment and high-performance inference of large models.","title":"FastDeploy 2.0: A Large-Scale Model Inference and Deployment Toolkit with Native Support for ERNIE 4.5 "},{"content":"ERNIE Bot | GitHub | Hugging Face | BAIDU AI Studio | Technical Report\nIntroduction to ERNIE 4.5 We introduce ERNIE 4.5, a new family of large-scale multimodal models comprising 10 distinct variants. The model family consist of Mixture-of-Experts (MoE) models with 47B and 3B active parameters, with the largest model having 424B total parameters, as well as a 0.3B dense model. For the MoE architecture, we propose a novel heterogeneous modality structure, which supports parameter sharing across modalities while also allowing dedicated parameters for each individual modality. This MoE architecture has the advantage to enhance multimodal understanding without compromising, and even improving, performance on text-related tasks. All of our models are trained with optimal efficiency using the PaddlePaddle deep learning framework, which also enables high-performance inference and streamlined deployment for them. We achieve 47% Model FLOPs Utilization (MFU) in our largest ERNIE 4.5 language model pre-training. Experimental results show that our models achieve state-of-the-art performance across multiple text and multimodal benchmarks, especially in instruction following, world knowledge memorization, visual understanding and multimodal reasoning. All models are publicly accessible under Apache 2.0 to support future research and development in the field. Additionally, we open source the development toolkits for ERNIE 4.5, featuring industrial-grade capabilities, resource-efficient training and inference workflows, and multi-hardware compatibility.\nERNIE 4.5 Highlights Our model family is characterized by three key innovations:\nMultimodal Heterogeneous MoE Pre-Training: Our models are jointly trained on both textual and visual modalities to better capture the nuances of multimodal information and improve performance on tasks involving text understanding and generation, image understanding, and cross-modal reasoning. To achieve this without one modality hindering the learning of another, we designed a heterogeneous MoE structure, incorporated modality-isolated routing, and employed router orthogonal loss and multimodal token-balanced loss. These architectural choices ensure that both modalities are effectively represented, allowing for mutual reinforcement during training.\nScaling-Efficient Infrastructure: We propose a novel heterogeneous hybrid parallelism and hierarchical load balancing strategy for efficient training of ERNIE 4.5 models. By using intra-node expert parallelism, memory-efficient pipeline scheduling, FP8 mixed-precision training and finegrained recomputation methods, we achieve remarkable pre-training throughput. For inference, we propose multi-expert parallel collaboration method and convolutional code quantization algorithm to achieve 4-bit/2-bit lossless quantization. Furthermore, we introduce PD disaggregation with dynamic role switching for effective resource utilization to enhance inference performance for ERNIE 4.5 MoE models. Built on PaddlePaddle, ERNIE 4.5 delivers high-performance inference across a wide range of hardware platforms.\nModality-Specific Post-Training: To meet the diverse requirements of real-world applications, we fine-tuned variants of the pre-trained model for specific modalities. Our LLMs are optimized for general-purpose language understanding and generation. The VLMs focuses on visuallanguage understanding and supports both thinking and non-thinking modes. Each model employed a combination of Supervised Fine-tuning (SFT), Direct Preference Optimization (DPO) or a modified reinforcement learning method named Unified Preference Optimization (UPO) for post-training.\nPerformance and Benchmark Results ERNIE-4.5-300B-A47B-Base surpasses DeepSeek-V3-671B-A37B-Base on 22 out of 28 benchmarks, demonstrating leading performance across all major capability categories. This underscores the substantial improvements in generalization, reasoning, and knowledge-intensive tasks brought about by scaling up the ERNIE-4.5-Base model relative to other state-of-the-art large models. With a total parameter size of 21B (approximately 70% that of Qwen3-30B), ERNIE-4.5-21B-A3B-Base outperforms Qwen3-30B-A3B-Base on several math and reasoning benchmarks, including BBH and CMATH. ERNIE-4.5-21B-A3B-Base remains highly competitive given its significantly smaller model size, demonstrating notable parameter efficiency and favorable performance trade-offs.\nERNIE-4.5-300B-A47B, the post trained model, demonstrates significant strengths in instruction following and knowledge tasks, as evidenced by the state-of-the-art scores on benchmarks such as IFEval, Multi-IF, SimpleQA, and ChineseSimpleQA. The lightweight model ERNIE-4.5-21B-A3B achieves competitive performance compared to Qwen3-30B-A3B, despite having approximately 30% fewer total parameters.\nIn the non-thinking mode, ERNIE-4.5-VL exhibits outstanding proficiency in visual perception, document and chart understanding, and visual knowledge, performing strongly across a range of established benchmarks. Under the thinking mode, ERNIE-4.5-VL not only demonstrates enhanced reasoning abilities compared to the non-thinking mode, but also retains the strong perception capabilities of the latter. ERNIE-4.5-VL-424B-A47B delivers consistently strong results across the full multimodal evaluation suite. Its thinking mode provides a distinct advantage on reasoning-centric tasks, narrowing or even surpassing the gap to OpenAI-o1 on challenging benchmarks such as MathVista, MMMU, and VisualPuzzle, while maintaining competitive performance on perception-focused datasets like CV-Bench and RealWorldQA. The lightweight vision-language model ERNIE-4.5-VL-28B-A3B achieves competitive or even superior performance compared to Qwen2.5-VL-7B and Qwen2.5-VL-32B across most benchmarks, despite using significantly fewer activation parameters. Notably, our lightweight model also supports both thinking and non-thinking modes, offering functionalities consistent with ERNIE-4.5-VL-424B-A47B.\nPerformace of ERNIE-4.5 pre-trained models Performance of post-trained model ERNIE-4.5-300B-A47B Performance of post-trained model ERNIE-4.5-21B-A3B Performance of post-trained multimodal models in thinking mode Performance of post-trained multimodal models in non-thinking mode Getting Started with ERNIE 4.5 The ERNIE 4.5 models are trained using the PaddlePaddle framework. The following sections detail tools and resources within the PaddlePaddle ecosystem for fine-tuning and deploying ERNIE 4.5 models.\nFor developers working within the PyTorch ecosystem, ERNIE 4.5 models are also available in PyTorch-compatible formats.\nERNIEKit: Fine-tuning and Alignment ERNIEKit is an industrial-grade development toolkit for ERNIE 4.5. It provides model training and compression capabilities, including pre-training, Supervised Fine-Tuning (SFT), Low-Rank Adaptation(LoRA), Direct Preference Optimization (DPO), Quantization-Aware Training (QAT) and Post-Training Quantization (PTQ) techniques.\nUsage Examples:\n# Download model huggingface-cli download baidu/ERNIE-4.5-300B-A47B-Base-Paddle \\ --local-dir baidu/ERNIE-4.5-300B-A47B-Base-Paddle # SFT erniekit train examples/configs/ERNIE-4.5-300B-A47B/sft/run_sft_wint8mix_lora_8k.yaml \\ model_name_or_path=baidu/ERNIE-4.5-300B-A47B-Base-Paddle # DPO erniekit train examples/configs/ERNIE-4.5-300B-A47B/dpo/run_dpo_wint8mix_lora_8k.yaml \\ model_name_or_path=baidu/ERNIE-4.5-300B-A47B-Base-Paddle For more detailed examples, please refer to ERNIEKit repository.\nFastDeploy: Efficient Model Deployment FastDeploy is an efficient deployment toolkit for large models based on PaddlePaddle. It offers an out-of-the-box, multi-hardware deployment experience with a single line of code, and its API is compatible with both vLLM and OpenAI protocols. For deploying the ERNIE 4.5 model, it provides an industrial-grade solution for multi-machine PD Disaggregation with multi-level load balancing, and it supports a wide range of acceleration technologies like low-bit quantization inference, context caching, and speculative decoding.\nLocal Inference Example:\nfrom fastdeploy import LLM, SamplingParams prompt = \u0026#34;Write me a poem about large language model.\u0026#34; sampling_params = SamplingParams(temperature=0.8, top_p=0.95) llm = LLM(model=\u0026#34;baidu/ERNIE-4.5-0.3B-Paddle\u0026#34;, max_model_len=32768) outputs = llm.generate(prompt, sampling_params) Service Deployment Example:\npython -m fastdeploy.entrypoints.openai.api_server \\ --model \u0026#34;baidu/ERNIE-4.5-0.3B-Paddle\u0026#34; \\ --max-model-len 32768 \\ --port 9904 Once the service is up and running using FastDeploy, it offers an OpenAI-compatible API, allowing for easy integration with existing tools and workflows.\nFor detailed documentation, installation guides, and advanced configuration options, please refer to FastDeploy repository.\nLicense The ERNIE 4.5 models are provided under the Apache License 2.0. This license permits commercial use, subject to its terms and conditions. Copyright (c) 2025 Baidu, Inc. All Rights Reserved.\nCitation If you find ERNIE 4.5 useful or wish to use it in your projects, please kindly cite our technical report:\n@misc{ernie2025technicalreport, title={ERNIE 4.5 Technical Report}, author={Baidu-ERNIE-Team}, year={2025}, primaryClass={cs.CL}, howpublished={\\url{https://ernie.baidu.com/blog/publication/ERNIE_Technical_Report.pdf}} } ","permalink":"/blog/posts/ernie4.5/","summary":"We introduce ERNIE 4.5, a new family of large-scale multimodal models comprising 10 distinct variants. The model family consist of Mixture-of-Experts (MoE) models with 47B and 3B active parameters, with the largest model having 424B total parameters, as well as a 0.3B dense model.","title":"Announcing the Open Source Release of the ERNIE 4.5 Model Family"},{"content":"Welcome to the official ERNIE Blog! We really love building large models!\nServices ERNIE Bot BAIDU AI Studio Qianfan Large Model Platform Resources Hugging Face PaddlePaddle ","permalink":"/blog/about/","summary":"\u003cp\u003eWelcome to the official ERNIE Blog! We really love building large models!\u003c/p\u003e\n\u003ch3 id=\"services\"\u003eServices\u003c/h3\u003e\n\u003cul\u003e\n\u003cli\u003e\n\u003ch4 id=\"ernie-bot\"\u003e\u003ca href=\"https://ernie.baidu.com/\"\u003eERNIE Bot\u003c/a\u003e\u003c/h4\u003e\n\u003c/li\u003e\n\u003cli\u003e\n\u003ch4 id=\"baidu-ai-studio\"\u003e\u003ca href=\"https://aistudio.baidu.com/\"\u003eBAIDU AI Studio\u003c/a\u003e\u003c/h4\u003e\n\u003c/li\u003e\n\u003cli\u003e\n\u003ch4 id=\"qianfan-large-model-platform\"\u003e\u003ca href=\"https://cloud.baidu.com/product-s/qianfan_home\"\u003eQianfan Large Model Platform\u003c/a\u003e\u003c/h4\u003e\n\u003c/li\u003e\n\u003c/ul\u003e\n\u003ch3 id=\"resources\"\u003eResources\u003c/h3\u003e\n\u003cul\u003e\n\u003cli\u003e\n\u003ch4 id=\"hugging-face\"\u003e\u003ca href=\"https://huggingface.co/baidu\"\u003eHugging Face\u003c/a\u003e\u003c/h4\u003e\n\u003c/li\u003e\n\u003cli\u003e\n\u003ch4 id=\"paddlepaddle\"\u003e\u003ca href=\"https://www.paddlepaddle.org.cn/\"\u003ePaddlePaddle\u003c/a\u003e\u003c/h4\u003e\n\u003c/li\u003e\n\u003c/ul\u003e","title":"About"},{"content":" ERNIE Team, Baidu. (2025). ERNIE 4.5 Technical Report. Cui, C. et al. (2025). PaddleOCR-VL: Boosting Multilingual Document Parsing via a 0.9B Ultra-Compact Vision-Language Model. Cui, C. et al. (2026). PaddleOCR-VL-1.5: Towards a Multi-Task 0.9B VLM for Robust In-the-Wild Document Parsing. Wang, H. et al. (2026). ERNIE 5.0 Technical Report. ","permalink":"/blog/publication/","summary":"\u003cul\u003e\n\u003cli\u003eERNIE Team, Baidu. (2025). \u003ca href=\"/blog/publication/ERNIE_Technical_Report.pdf\"\u003eERNIE 4.5 Technical Report\u003c/a\u003e.\u003c/li\u003e\n\u003cli\u003eCui, C. et al. (2025). \u003ca href=\"/blog/publication/PaddleOCR-VL_Technical_Report.pdf\"\u003ePaddleOCR-VL: Boosting Multilingual Document Parsing via a 0.9B Ultra-Compact Vision-Language Model\u003c/a\u003e.\u003c/li\u003e\n\u003cli\u003eCui, C. et al. (2026). \u003ca href=\"https://arxiv.org/abs/2601.21957\"\u003ePaddleOCR-VL-1.5: Towards a Multi-Task 0.9B VLM for Robust In-the-Wild Document Parsing\u003c/a\u003e.\u003c/li\u003e\n\u003cli\u003eWang, H. et al. (2026). \u003ca href=\"https://arxiv.org/abs/2602.04705\"\u003eERNIE 5.0 Technical Report\u003c/a\u003e.\u003c/li\u003e\n\u003c/ul\u003e","title":"Publication"}]