Triton is a domain-specific language and compiler for writing high-performance GPU kernels in Python. It offers fine-grained control over memory and parallelism, making it ideal for custom, architecture-optimized compute in machine language and high-performance computing workloads. However, Triton relies on just-in-time (JIT) compilation, which can introduce latency during the first execution of each GPU kernel, especially in production or multi-environment deployments.
To mitigate this, we’re introducing OCI image support for model kernel caches, through a new utility called Model Cache Vault (MCV). MCV packages compile kernel caches into OCI-compliant images that can be verified using Sigstore Cosign with cryptographic signatures. The combination of MCV and Cosign produces trusted, portable artifacts that can be shared across environments, reducing cold-start overhead and aligning with modern DevSecOps practices.
Note: Red Hat’s Emerging Technologies blog includes posts that discuss technologies that are under active development in upstream open source communities and at Red Hat. We believe in sharing early and often the things we’re working on, but we want to note that unless otherwise stated the technologies and how-tos shared here aren’t part of supported products, nor promised to be in the future.
This is the first installment in a three-week deep dive into Triton kernel trust and performance. In this post, we’ll cover the motivation, mechanics, and integration of this feature into your workflow.
Please note: MCV was previously known as TCV (Triton Cache Vault); its scope has since been expanded to include vLLM. You might still find references to TCV within this article. This article focuses on the vLLM use case that leverages Triton.
Why does Triton JIT need help?
Triton’s JIT model is essential for performance. It compiles kernels on-the-fly, tuned to the local hardware and workload. However, this compilation occurs at runtime and per process, leading to duplicate work when running the same kernels across different environments. While this isn’t a major issue during development, it becomes costly in production, especially when:
- Containers spin up repeatedly in autoscaling clusters
- CI jobs rebuild kernels unnecessarily
- Cold-start latency affects user-facing services
To address this, model caches can now be exported, shared, and reused via OCI images, transforming what was once a runtime-only optimization into a portable, infrastructure-level performance enhancement. Why take this approach instead of ahead-of-time (AOT) compilation? One reason is flexibility: even if a prebuilt image for the target hardware isn’t available, deployment can still proceed with a JIT compilation cost. In contrast, AOT would simply fail in that scenario.
Enter Model Cache Vault (MCV): Cache packaging
To make this possible, we’ve built MCV, a utility for packaging Triton’s JIT-compiled kernel or vLLM caches into OCI container images
These images can then be signed using Sigstore Cosign to drive greater trust and traceability. With MCV, cache management becomes just another part of your container workflow:
- Compile Model kernels to generate a local cache
- Package the cache into an OCI image using MCV (can also be integrated into CI pipelines)
- Sign it using Sigstore Cosign for provenance and integrity
- Push to a registry of your choice
- Verify Signature, pull and extract wherever the Kernel is needed
This enables cache reuse across machines, environments, and pipelines, reducing cold-start overhead and increasing reliability.
What’s in a Triton cache image?
A Triton cache image contains the compiled GPU kernels for one or more Triton programs, targeting specific architectures. By storing these as OCI images, we gain:
- Portability: Share across machines, users, or environments
- Efficiency: Skip compilation and run instantly
- Security: Sign and verify with Cosign
- Compatibility: Use with any OCI-compatible registry or tool (Docker, Podman, Kubernetes, etc.)
It’s just another container image, but it contains your precious, pre-tuned performance artifacts.
Enhancing cache security with Sigstore Cosign
Cache distribution introduces a new challenge: trust. If you’re loading precompiled GPU binaries, how do you know they’re safe? By using Sigstore Cosign to enable cryptographic signing of cache images with signatures, you can:
- Verify the origin of the image before loading it
- Detect tampering or unauthorized changes
- Enforce policy-based validation in your CI/CD or runtime environment
This aligns Triton cache distribution with modern DevSecOps and supply chain security best practices, making it safer to cache once and run anywhere.
How does MCV work?
Creating a Triton/vLLM Cache Image
Below is an example of using MCV to create an OCI Image from local source code:
$ ./_output/bin/linux_amd64/mcv -c -i quay.io/mtahhan/01-vector-add-cache -d example/01-vector-add-cache-rocm INFO[2025-05-27 11:32:48] baremetalFlag false INFO[2025-05-27 11:32:48] Using buildah to build the image INFO[2025-05-27 11:32:49] Wrote manifest to /tmp/buildah-manifest-dir-2184368335/manifest.json INFO[2025-05-27 11:32:49] Image built! 4a600c3c76c658fe1d6f960fbba648a294df3743c8711a1dd6acea7ea047d75f INFO[2025-05-27 11:32:49] Temporary directories successfully deleted. INFO[2025-05-27 11:32:49] OCI image created successfully.
Once the image is created, verify the image was created successfully. The output logs above shows buildah was used to build the image, so that should be used to list the images. buildah is preferred and will be used if detected, but docker and podman can also be used.
$ buildah images REPOSITORY TAG IMAGE ID CREATED SIZE quay.io/mtahhan/01-vector-add-cache latest beb4781fc7d1 About a minute ago 84.1 KB
Extracting a Triton Cache Image
Below is an example of trying to extract a CUDA kernel on a platform with only a ROCm GPU. MCV begins by collecting information about the system’s GPUs using tools like rocm-smi or amd-smi (and nvml for NVIDIA GPUs). Next, it attempts to pull the container image and compares the backend specified in the image labels with the GPU details it detected on the host system. In the example below, this check fails because the image backend is set to “cuda”, while the system has an AMD GPU.
$ ./_output/bin/linux_amd64/mcv -e -i quay.io/mtahhan/01-vector-add-cache:latest INFO[2025-05-27 11:36:15] baremetalFlag false INFO[2025-05-27 11:36:15] Adding the device to the registry [gpu][AMD] INFO[2025-05-27 11:36:15] Using AMD to obtain GPU info INFO[2025-05-27 11:36:15] Error registering rocm-smi: AMD already registered. Skipping ROCM INFO[2025-05-27 11:36:15] Initializing the Accelerator of type gpu INFO[2025-05-27 11:36:15] Starting up AMD INFO[2025-05-27 11:36:16] Using AMD to obtain GPU info INFO[2025-05-27 11:36:16] Startup gpu Accelerator successful INFO[2025-05-27 11:36:16] Trying local fetcher: *fetcher.dockerFetcher INFO[2025-05-27 11:36:16] Failed to fetch image locally using *fetcher.dockerFetcher: INFO[2025-05-27 11:36:16] Retrieve remote Img quay.io/mtahhan/01-vector-add-cache:latest!!!!!!!! INFO[2025-05-27 11:36:16] Img fetched successfully!!!!!!!! INFO[2025-05-27 11:36:17] Compatible cache found: 6088b9b2e5149e06670bcec5bb3df8c56758f20ec349dd5f77f1507fe7cdf5fa INFO[2025-05-27 11:36:17] Temporary directories successfully deleted.
Signing Container Images
Signing MCV container images can be easily done using Sigstore Cosign. Below is a summary of the steps needed to sign an image.
Step 1: Install Cosign
First, install the latest version of Cosign using the following command:
go install github.com/sigstore/cosign/v2/cmd/cosign@latest
Step 2: Sign the Image
To sign your container image, use the Cosign sign command. It is recommended to always use the image SHA rather than the latest tag. The latest tag can change over time, while the SHA uniquely identifies the specific image version, ensuring that you are signing the correct and immutable image. For example:
$ cosign sign -y quay.io/mtahhan/01-vector-add-cache@sha256:1fe2866013c0270d433c80e50c4aaa25920f9eb949e731816f1ca89279c21ca6 ⏎ Generating ephemeral keys... Retrieving signed certificate... The sigstore service, hosted by sigstore a Series of LF Projects, LLC, is provided pursuant to the Hosted Project Tools Terms of Use, available at https://lfprojects.org/policies/hosted-project-tools-terms-of-use/. Note that if your submission includes personal data associated with this signed artifact, it will be part of an immutable record. This may include the email address associated with the account with which you authenticate your contractual Agreement. This information will be used for signing this artifact and will be stored in public transparency logs and cannot be removed later, and is subject to the Immutable Record notice at https://lfprojects.org/policies/hosted-project-tools-immutable-records/. By typing 'y', you attest that (1) you are not submitting the personal data of any other person; and (2) you understand and agree to the statement and the Agreement terms at the URLs listed above. Your browser will now be opened to: ...
During the signing process, Cosign will generate a URL and open it in your default browser. You will be prompted to authenticate using one of the supported providers:
- GitHub
- Microsoft
Once authenticated, Cosign will generate a verification code that you need to enter in the terminal to complete the signing process.
Step 3: Review the Signing Notice (or skip by passing the ‘-y’ flag on the commandline)
Cosign will display a legal notice explaining how your signing data will be stored:
- The signing service is hosted by Sigstore, part of the Linux Foundation (LF).
- Any data submitted, including your email address, will be stored in public transparency logs and cannot be removed.
- You must confirm that you are not submitting any personal data belonging to others.
To proceed, type y to agree to the terms
Step 4: Complete the Signing Process
After agreeing to the terms, Cosign will:
- Generate ephemeral keys.
- Retrieve a signed certificate.
- Create a transparency log (tlog) entry.
- Push the signature to your container registry.
Upon successful completion, you will see an output similar to:
Successfully verified SCT... tlog entry created with index: 215011903 Pushing signature to: quay.io/mtahhan/01-vector-add-cache
Note: One can also inspect the image, as shown below.
$ skopeo inspect containers-storage:quay.io/mtahhan/01-vector-add-cache:latest
{
"Name": "quay.io/mtahhan/01-vector-add-cache",
"Digest": "sha256:785f3d7fb1cc38a1c817bb3d72e4fecde75a367093a1133db7549d5150584cfb",
"RepoTags": [],
"Created": "2025-05-27T11:32:49.368494851Z",
"DockerVersion": "",
"Labels": {
"cache.triton.image/cache-size-bytes": "80415",
"cache.triton.image/entry-count": "1",
"cache.triton.image/summary": "{\"variant\":\"multi\",\"entry_count\":1,\"targets\":[{\"backend\":\"hip\",\"arch\":\"gfx90a\",\"warp_size\":64}]}",
"cache.triton.image/variant": "multi"
},
"Architecture": "amd64",
"Os": "linux",
"Layers": [
"sha256:370e94d944938ba1f0fbf509def1be948d8cc8665a56056dbc7ee94eacb287ca"
],
"LayersData": [
{
"MIMEType": "application/vnd.oci.image.layer.v1.tar",
"Digest": "sha256:370e94d944938ba1f0fbf509def1be948d8cc8665a56056dbc7ee94eacb287ca",
"Size": 93184,
"Annotations": null
}
],
"Env": null
}
Demo: Seeing Model Cache Vault in Action
To showcase the power of Model Cache Vault (MCV), we set up a simple yet representative demonstration.

Our Playground: We used a single-node Kubernetes cluster equipped with two AMD ROCm GPUs, all set up using kubeadm. This gave us a realistic environment to mimic many modern deployments.
We chose to deploy the Llama-3.1 model, specifically Llama-3.1-8B-Instruct. We tracked the model’s startup time by pulling detailed timing stats directly from the logs.
The Baseline: To get a benchmark, we first ran the model without any precaching of the vLLM cache. This meant the model kernels were compiled from scratch. The result? A startup time of a rather hefty 112 seconds.
Our Secret Weapon: The Init Container: To leverage MCV, we introduced an init container. This container was responsible for extracting and reusing our pre-compiled vLLM cache. Think of it as giving the model a head start by providing it with ready-to-go components.
The Results are In: And the difference was dramatic! With the pre-loaded cache, our startup time plummeted to just over 62 seconds. That’s almost exactly a 2x speedup! Imagine the impact in a production environment where services need to scale quickly and respond instantly.
Watch the full demo here:
This demo clearly illustrates the significant benefits of using Model Cache Vault to manage and reuse model caches. Not only does it save time, but it also ensures a more consistent and predictable startup experience.
When should you use MCV?
You’ll benefit from MCV if you:
- Deploy Triton kernels or vLLM models at scale and care about startup latency
- Use containerized or cloud-native infrastructure
- Want to avoid redundant compilation in CI/CD or runtime environments
- Operate in security-sensitive or regulated environments
- Need reproducibility and trust in your GPU execution stack
Summary
OCI image support for Triton/vLLM kernel caches, powered by MCV and secured with Sigstore Cosign, brings together performance, portability, and trust. It turns Triton/vLLM’s runtime optimization into a reusable asset, speeding up cold starts and reducing compute waste.
You don’t need to choose between speed and supply chain integrity, now you can have both.

