Benchmarking AI inference on CPUs: A transparent blueprint for the enterprise

by Maryam Tahhan, John Harrigan, Anton Ivanov, Paul Power, Luigi Mario Zuccarelli | May 28, 2026 | AI

As enterprises look to optimize the total cost of ownership (TCO) of Large Language Model deployment, utilizing existing enterprise CPU infrastructure alongside GPU resources for specific inference workloads has become a strategic initiative. However, infrastructure teams attempting to validate this face a chaotic benchmarking landscape. Currently, performance evaluations lack any shared industry standard; hardware vendors frequently publish isolated, best-case throughput metrics without transparent, reproducible methodologies.

To build the architectural trust required for production adoption, we must standardize how the ecosystem evaluates CPU inference. Engineering teams need a rigorous, repeatable benchmarking framework that moves past idealized marketing data to accurately account for the realities of unoptimized traffic, variable token distributions, and actual hardware saturation limits.

To establish this standard, Red Hat’s OCTO Emerging Tech and Perf & Scale Engineering teams developed the vLLM CPU Performance Evaluation framework. This automated, open source testing suite uses Ansible and GuideLLM to standardize performance benchmarking, establishing universally verifiable baselines that allow enterprises to confidently evaluate CPU capacity, identify hardware bottlenecks, and validate production optimizations.

Note: Red Hat’s Emerging Technologies blog includes posts that discuss technologies that are under active development in upstream open source communities and at Red Hat. We believe in sharing early and often the things we’re working on, but we want to note that unless otherwise stated the technologies and how-tos shared here aren’t part of supported products, nor promised to be in the future.

Key takeaways

The need for a standard: Evaluating the feasibility of running specific inference workloads on CPUs is currently hindered by fractured, non-transparent benchmarks that mask real-world production behavior.
The solution: A standardized, open source benchmarking framework (vllm-cpu-perf-eval) built using Ansible and GuideLLM to simulate reproducible, multi-tenant enterprise conditions.
The impact: Provides infrastructure engineers with a uniform testing standard to accurately de-risk CPU adoption by measuring raw baseline capacity, traffic variability, and optimization gains.
The framework: A rigorous 3-phase testing methodology covering baseline capacity, realistic variability analysis, and production optimization validation.

How do we standardize vLLM CPU performance testing?

This section outlines the stages, metrics, and procedures for evaluating the performance of the vLLM framework (CPU Mode) when running Small Language Models (SLMs).

The standard benchmarking utility

The chosen Performance Evaluation tool is GuideLLM (specifically v0.6.0). GuideLLM is a platform/workload generator for evaluating Large Language Models (LLMs). By simulating real-world inference workloads, GuideLLM enables users to assess the performance, resource requirements, and cost implications of deploying LLMs on various hardware configurations.

GuideLLM serves a dual role in our architecture:

Benchmark utility: Provides the framework and metrics for measuring performance
Load generator: Simulates user requests and workloads necessary to test the system under realistic or stress conditions

Models evaluated

We selected a range of Decoder-Only architectures to stress test everything from high-velocity chat to complex Mixture-of-Experts (MoE) scaling.

Architecture Family	Representative Model	Key Application Focus	Rationale
Llama 3	Llama-3.2-1B-Instruct	Prefill-Heavy	Latest Llama architecture; benchmarks high-density input processing.
Llama 2	TinyLlama-1.1B-Chat	Small-Scale Efficiency	Resource-efficient variant for edge/low-core environments.
IBM Granite	granite-3.2-2b-instruct	Enterprise Baseline	Red Hat’s flagship balanced model for production instruction-following.
Qwen 3	Qwen3-0.6B	High-Efficiency	Optimized for maximum throughput with minimal memory footprint.
Transformer MoE	gpt-oss-20b	Scalability Testing	21B (3.6B active) parameters to establish baseline CPU capacity and scaling behavior on 128k context windows.

Standardized workload profiles

To benchmark models uniformly under different types of heavy compute stress, we defined five distinct workload profiles. These profiles are strictly differentiated by their balance of Input Sequence Length (ISL) and Output Sequence Length (OSL). This allows us to benchmark both prefill-heavy (long input context processing) and decode-heavy (long generated output text) scenarios. The workloads are listed in the table below:

WORKLOAD-TYPE	ISL:OSL	Description
chat	512:512	Balanced prefill/decode, typical conversational AI
summarization	2048:256	Medium context, summarization tasks
Code Gen	1024:1024	Balanced code generation scenarios
RAG	8192:512	Long context prefill, retrieval-augmented generation
Reasoning	256:2048	Long output decode, chain-of-thought reasoning

Measuring the full stack: client, server, and system metrics

To capture the true performance profile, we track three distinct telemetry tracks simultaneously. This allows us to see, for example, how a spike in Request Queue Depth on the server correlates to a degradation in time to first token for the end user.

1. Client-side metrics (the user experience)

Time to First Token (TTFT): The perceived responsiveness
Inter-Token Latency (ITL): The consistency of the model’s “reading speed”
End-to-End Latency (E2E): Total round-trip time for the complete response
Total Token Throughput: The aggregate volume of data delivered to the user base

2. Server-side metrics (the engine performance)

TTFT, ITL, E2E Latency and Total Token Throughput (from the server POV)
KV Cache utilization: Tracks how efficiently memory is being used for context
Request queue depth: Identifies bottlenecks where the engine is saturated
Token generation rate: The baseline generation capacity of the CPU environment, independent of network overhead
Avg. prompt/generation length: Provides context for throughput fluctuations

3. System-level metrics (infrastructure health)

CPU utilization (%): Monitors core saturation and identifies if we are hitting compute or memory-bandwidth limits
Memory consumption (GB): Total footprint, including the model weights and the dynamic KV Cache buffer

What is our 3-phase testing methodology?

The vLLM CPU performance evaluation framework uses a structured 3-phase testing approach to separate baseline performance measurement, realistic variability analysis, and production optimization evaluation. This methodology was designed to be adaptable across test suites but is currently fully implemented only for concurrent load testing.

Key Benefits:

Reproducible baselines: Fixed workloads eliminate variability for apples-to-apples comparisons
Realistic insights: Variable workloads simulate real-world traffic patterns
Production validation: Caching comparison quantifies optimization benefits
Clear progression: Each phase builds on previous results

Stage 1: Baseline tests

Goal: Identify maximum throughput and hardware saturation points
Method: Uses synthetic datasets, fixed Input Sequence Length (ISL) and Output Sequence Length (OSL), and disables prefix caching (vLLM)
Critical workload ratios (ISL:OSL):
- Chat: 512:512 (Conversational AI)
- RAG: 8192:512 (Long Context Retrieval)
- Code: 1024:1024 (Code Generation)
- Summarization: 2048:256 (Document Processing)
- Reasoning: 256:2048 (Dense Embedding)

Stage 2: Realistic tests

Goal: Break artificial consistency and measure variance-sensitive models
Method: Introduces variability by injecting realistic token distribution variance to measure P95/P99 batch completion spreads
Variable examples:
- Chat: Input 512±128, Output 256±64
- CodeGen: Input 512±128, Output 4096±1024
- Summarization: Input 1024±512, Output 256±128

Stage 3: Production tests

Goal: Quantify the impact of production-grade optimizations.
Method: Transitions to real-world datasets and enables prefix caching (vLLM) to confirm final latency characteristics.

Benchmark run anatomy: Each run adheres to a 600-second window, comprising a 30s warm-up, a 540s measurement window with strict boundary enforcement, and a 30s cooldown.

Why is an isolated architecture critical for standardized testing?

To guarantee the integrity of our performance data, we adhere to a “Strict Isolation of Concerns” principle. We ensure that the load generation process never competes for the same compute or memory resources as the inference engine. Any shared overhead would skew the CPU-bound LLM metrics we aim to measure.

Our framework supports two primary deployment modes to achieve this isolation:

Mode 1: Physical node separation (ideal)

The gold standard for benchmarking. We utilize a dual-node setup connected by a high-speed network segment.

Load Generator Node: A dedicated system running the GuideLLM benchmarking suite
Device Under Test (DUT): An isolated system running vLLM on Intel Xeon or AMD EPYC platforms. This eliminates any possibility of “noisy neighbor” effects at the hardware level.

Mode 2: Intra-node socket isolation (logical separation)

When separate physical nodes are not available, we utilize socket-level isolation. In this mode, we leverage the multi-socket architecture of modern enterprise CPUs (e.g., AWS c8i.metal-48xl or m8a.metal-48xl instances):

Socket 0: Dedicated strictly to GuideLLM and system overhead
Socket 1 (The DUT): Dedicated exclusively to the vLLM container(s)

Ansible tasks guarantee core pinning and NUMA-aware memory allocation, ensuring the load generator and inference engine operate on independent L3 caches and memory controllers.

Memory efficiency: standardizing the KV Cache footprint

After model weights, the Key-Value (KV) Cache is the most significant memory consumer in an inference pipeline. If undersized, the system faces out-of-memory (OOM) crashes; if oversized, it wastes expensive enterprise resources. To achieve the “Enterprise AI Reality,” we move away from guesswork toward a right-sized cache strategy.

We provide reproducibility and efficiency by employing a transparent mathematical approach to calculate the exact footprint required:

Total Elements = 2 x layers x total_tokens x num_kv_heads x head_size

Total Bytes = Total Elements x dtype_size

(Note: For bfloat16, dtype_size is 2 bytes)

Our sizing strategy

Our methodology ensures stability during traffic spikes without bloated over-provisioning:

Base requirement: Calculate the per-request size based on the specific model architecture.
Linear scaling: Multiply by the planned concurrency (our standard test suite uses up to 32 simultaneous requests).
The safety margin: Apply a 1.25x (25%) buffer. This protection layer ensures that variable token distributions don’t trigger immediate failure.

Example Calculation: Llama-3.2-1B Chat

For a 1024-token workload (512 ISL : 512 OSL):

Base: 0.0312 GB per request.
Scale: x 32 concurrent requests.
Protect: x 1.25 safety margin.
Result: 1.25 GB required -> Configured at 2 GiB for optimal system alignment.

This right-sized approach allows enterprises to maximize density on CPU platforms, running more models per socket while maintaining a 100% reliability target.

Grading the stability and repeatability of benchmark results

A benchmark is useless if running it twice yields entirely different scores. To ensure our framework delivers scientifically rigorous repeatability, we don’t just report the mean; we grade the stability of every benchmark run using the Coefficient of Variation (CV):

The reliability grading scale

This scale allows engineers to quickly determine if a hardware configuration or optimization is “production-ready” or suffering from environmental noise.

CV Range	Repeatability Grade
< 1.0%	Excellent (A+)
1.0% – 3.0%	Good (A to B+)
3.0% – 5.0%	Acceptable (B)
> 5.0%	Poor (C)

Establishing an open standard

This methodology offers a standardized, technical foundation for evaluating LLM inference on CPU platforms. By moving away from marketing claims toward a transparent, 3-phase framework, we provide enterprises with a verifiable path to production.

Performance is no longer a black box; instead, it has become a predictable, measurable resource.

Join the community: We invite you to review our automated testing repository and contribute to the ongoing evolution of CPU inference at github.com/redhat-et/vllm-cpu-perf-eval.

view posts