Expanding AI beyond text: an introduction to vLLM-Omni

by Ricardo Noriega De Soto | Jun 30, 2026 | AI

AI models no longer just read and write text. The latest generation of open source models—Qwen3-Omni, Hunyuan-Image3, MiMo-Audio—can reason across text, images, audio, and video simultaneously, generating rich multimodal outputs from a single prompt. But the infrastructure to serve these models in production has not kept pace. vLLM-Omni is an open source framework for multimodal models that closes that gap, extending the widely adopted vLLM inference engine to support omni-modality model serving with the performance characteristics that production deployments require.

Why is it so hard to serve multimodal AI models?

vLLM has become the de facto standard for serving large language models. Its PagedAttention memory management, continuous batching, and OpenAI-compatible API make it the go-to choice for deploying text-based LLMs in production. But vLLM was designed for a single paradigm: autoregressive text generation. One model, one engine, one output modality.

Today’s omni-modality models break all three of those assumptions.

Consider what happens when a user asks Qwen3-Omni to “explain this diagram and read your answer aloud.” The model must run a vision encoder to understand the image, an autoregressive LLM (the “thinker”) to reason about the content and generate text, a second autoregressive model (the “talker”) to produce speech codec tokens, and a waveform decoder (code2wav) to convert those tokens into audible speech. That is four distinct model components with different architectures, different compute profiles, and different memory requirements all orchestrated to serve a single user request.

Existing serving systems force developers to handle this complexity manually. You deploy each component as a separate service, write custom glue code for data routing between them, manage GPU allocation by hand, and accept the latency penalties of sequential execution. The result is fragile pipelines that are difficult to scale and impossible to optimize holistically.

Three specific gaps make this untenable at scale:

No support for non-autoregressive generation

Diffusion Transformers (DiT), the architecture behind image, video, and some audio generation, have fundamentally different execution patterns from autoregressive LLMs. They run iterative denoising loops rather than token-by-token generation, require different schedulers, different memory management, different batching strategies, and different parallelism approaches. Existing LLM serving engines simply cannot run them.

No pipeline orchestration

Omni-modality models are not monolithic. They are heterogeneous pipelines where the output of one stage feeds the input of the next. A serving system needs to decompose a model into stages, allocate resources per stage, and overlap execution to hide latency. None of the current text-focused serving engines provide this abstraction.

No unified API surface

Users expect a single endpoint that accepts a prompt and returns text, audio, images, or video depending on the model’s capabilities. Building this from individual microservices means reimplementing protocol handling, streaming, and error propagation for every deployment.

What vLLM-Omni does differently

vLLM-Omni is a framework-level extension of vLLM for multimodal models, not a wrapper. It introduces three architectural primitives that address the gaps above: a multistage pipeline abstraction, a dual-engine runtime, and a pluggable inter-stage communication layer.

Multistage pipeline decomposition

The core abstraction is OmniStage. A complex model is decomposed into a directed graph of stages, each running as an independent process with its own engine, scheduler, and GPU memory budget. The decomposition is declarative: defined in a YAML configuration file that specifies stage topology, model architectures, device assignments, and data flow.

For example, Qwen3-TTS decomposes into two stages: a Talker (autoregressive LLM generating speech codec tokens on 30% of GPU memory) connected via shared memory to a Code2Wav decoder (waveform generator using another 30% of the same GPU). The stages run concurrently. As the Talker produces codec token chunks, the Code2Wav stage begins generating audio immediately, enabling streaming speech output without waiting for the full token sequence.

This decomposition is not limited to two stages. Dynin-Omni runs a three-stage pipeline (text generation, image generation, audio generation) from a single prompt. Hunyuan-Image3 splits across eight GPUs: four running an autoregressive model for visual token generation with tensor parallelism, and four running a Diffusion Transformer for image denoising—with KV cache transferred directly between them.

The project currently ships 18 stage configurations covering models from Qwen3-Omni to MammothModa2, with single-stage, two-stage, and three-stage topologies supporting text, image, video, and audio outputs.

Dual-engine architecture

vLLM-Omni runs two distinct engine types within the same framework. The AutoRegressive (AR) Engine reuses upstream vLLM’s engine directly, inheriting its PagedAttention KV cache management, continuous batching, and efficient GPU memory utilization for autoregressive text and token generation tasks.

The Diffusion Engine is purpose built for non-autoregressive generation. It has its own scheduler (supporting both per-request and per-step scheduling), its own caching system (TeaCache for adaptive step-skipping, Cache-DiT for DiT-specific caching), and its own parallelism primitives. The Diffusion Engine is purpose built for this task. It can intelligently split the workload across multiple GPUs in various ways—much like a specialized factory assembly line—to dramatically speed up the process.

The diffusion engine supports over 30 model implementations spanning image generation (Flux, Stable Diffusion 3, GLM-Image, Hunyuan-Image3), video generation (Hunyuan Video, WAN2.2, LTX2), and audio generation (CosyVoice3, OmniVoice, Stable Audio).

Pluggable inter-stage connectors

Data routing between stages is handled by OmniConnector, an extensible abstraction layer with multiple backends:

SharedMemoryConnector for single-node deployments, using IPC with codec-aware streaming (frame-aligned transport with configurable chunk sizes)
MooncakeConnector and YuanrongConnector for multinode deployments, supporting RDMA-capable network transfers
KV Transfer Manager for direct KV cache transfer between autoregressive and diffusion stages, avoiding redundant recomputation

Performance

The architecture delivers measurable improvements over baseline approaches. The first introductory academic paper (arXiv:2602.02204) reports that vLLM-Omni reduces job completion time by up to 91.4% compared to sequential execution baselines. The gains come from three sources: pipelined stage execution overlapping (stages process different requests concurrently), per-stage batching (each stage batches requests independently based on its own compute characteristics), and dynamic GPU resource allocation across stages.

For autoregressive stages, vLLM-Omni inherits vLLM’s state-of-the-art throughput. For diffusion stages, the caching and parallelism optimizations (TeaCache, CFG parallelism, sequence parallelism) provide additional acceleration over naive implementations.

Belonging to the vLLM and PyTorch ecosystem

vLLM-Omni is not a standalone project. It is part of the vLLM ecosystem and the PyTorch foundation. This positioning within vLLM matters for three practical reasons:

Upstream alignment: vLLM-Omni rebases regularly onto upstream vLLM releases (currently aligned with vLLM v0.20.0). AR model support, memory management improvements, and API changes flow downstream automatically. The project does not fork vLLM; it extends it.

HuggingFace integration: Models load directly from HuggingFace with the same interface developers already know. If you have served a model with vLLM before, serving an omni-modality model with vLLM-Omni follows the same patterns—vllm-omni serve or the Python Omni class for offline inference.

PyTorch-native: The diffusion engine uses standard PyTorch operations, supports torch.compile integration, and works with PyTorch’s distributed primitives. This means the full PyTorch optimization ecosystem—quantization via GPTQ/AWQ/FP8, custom attention backends, hardware-specific kernels—applies to vLLM-Omni without modification.

The hardware support reflects this ecosystem alignment. vLLM-Omni runs on NVIDIA GPUs (CUDA), AMD GPUs (ROCm), Ascend NPUs, Intel XPUs, and Moore Threads (MUSA) GPUs. Each platform has a plugin interface that provides platform-specific worker classes, attention backends, and optimized operations while sharing the same high-level framework code.

The serving API

vLLM-Omni exposes a full OpenAI-compatible API that covers all output modalities through familiar endpoints:

/v1/chat/completions: text and multimodal chat with omni-aware multistage support
/v1/audio/speech: text-to-speech generation (with batch and streaming variants)
/v1/images/generations: DALL-E-compatible text-to-image generation
/v1/images/edits: image-to-image editing with prompts, masks, and reference images
WebSocket endpoints for streaming speech and video generation

The API includes voice management (/v1/audio/voices) with upload support for voice cloning, and an asynchronous job system for long-running video generation tasks. A pure diffusion mode enables lightweight deployment when only image or video generation is needed, skipping the full LLM initialization.

What comes next

The project is moving fast. Over the next couple of months, the Q2 2026 vLLM-Omni roadmap targets expanded model coverage across omni, TTS, and diffusion families including world models and VLA (vision-language-action) models for robotics. A large-scale disaggregated serving RFC is underway to enable independent scaling of encoder, prefill, decode, and generation stages across node clusters. Deeper optimizations are landing: FP8/INT4 quantization, KV cache CPU offloading, prefix caching, diffusion continuous batching, and a Diffusers backend for broader model compatibility. Hardware-specific roadmaps for ROCm, XPU, NPU, and MUSA are driving platform parity.

With 12 releases shipped since November 2025, 4,600 GitHub stars, and a growing contributor base, vLLM-Omni is building the serving layer that the next generation of AI models requires.

This is the first in a series of posts about vLLM-Omni and multimodality from the Model Architectures team at Red Hat Emerging Technologies. Our team is heavily involved in the project, and in upcoming posts we will go deeper into specific topics: the diffusion engine internals, multistage pipeline performance, hardware platform support, and practical deployment patterns. If you want to get started now, vLLM-Omni documentation covers installation and quickstart guides, and the community is active in #sig-omni on Slack and the vLLM user forum. The code is Apache 2.0 licensed and lives at github.com/vllm-project/vllm-omni.

Ricardo Noriega De Soto

view posts