Protecting Triton kernel deployments with cryptographic signatures

by , | Feb 5, 2026 | AI

Triton is a domain-specific language and compiler for writing high-performance GPU kernels (snippets of compiled GPU code) using a Python-like syntax. It offers fine-grained control over memory and parallelism, making it ideal for custom, architecture-optimized compute in machine language and  high-performance computing workloads. 

The simplicity of the Python-like syntax combined with state of the art GPU level performance is achieved through a combination of just-in-time (JIT) compilation for the GPU backend and caching of the compiled GPU code. Each Triton “kernel” corresponds to a particular function from the python-like source. If a particular Triton program is run again, the compiler will attempt to load code from the cache before resorting to JIT compilation. The loading takes place based on a simple signature + arguments match without any additional checks. While the potential for outright exploitation is unclear, this lack of verification offers ample possibilities for denial of service, corruption and other runtime problems.

Note: Red Hat’s Emerging Technologies blog includes posts that discuss technologies that are under active development in upstream open source communities and at Red Hat. We believe in sharing early and often the things we’re working on, but we want to note that unless otherwise stated the technologies and how-tos shared here aren’t part of supported products, nor promised to be in the future.

This problem is not new – it is common when you have to load executable code at runtime. Can you trust it? Can you be sure that this piece of code was not modified in the meantime? How do you know that it is coming from a trusted source (even if that is yourself)? How do you maintain integrity across a scale-out and/or deployment to large scale production? To deal with all of these problems, we propose the introduction of  cryptographic signature support in the Triton kernel loader. 

This is the second installment in a three-week deep dive into Triton kernel trust and performance. Last week, we covered OCI caching. In this post, we’ll cover the motivation, mechanics, and integration of adding cryptographic signing into your workflow.

Enhancing the Triton JIT cache

Triton is designed and maintained as a relatively closed system. It is intended to be used “as is” to process python code with Triton language extensions instead of embedding parts of it as a library into another project. The user supplies python code with the Triton language extensions. Behind the scenes, Triton  JIT-compiles binary kernels for the accelerator hardware and passes user data to them for processing, with the result returned via the python code. Compilation, execution and caching of the kernels is controlled through environment configuration “knobs” with minimal possibilities to enhance the backend functionality via user provided code.

There is, however, one exception to this – it is the search for and loading of “kernel file lists” (called groups) from the Triton JIT cache. This part is user configurable and may use external python modules implementing the Triton CacheManager interface. Using a user supplied module instead of Triton stock configurations may allow us to alter the behaviour and apply some additional policies, such as:

  • What should be JIT-ed and what must be supplied from the cache.
  • What should be picked up from the default Triton JIT cache and what should be supplied from a different source
  • Most importantly, using an alternative CacheManager allows adding verification steps. 

The downside of this approach is that it will be of limited utility when Triton is used as a component in other projects like pytorch, vLLM, etc. As noted before, such “library style” use is contrary to Triton’s design philosophy. These projects achieve it by rewriting core functionality and/or “borrowing” compiled artifacts from Triton so they can use them through  their own loaders. As a result, CacheManager overrides can act only once – when Triton is invoked for the first time. Any future access to the compiled artifacts will bypass them and use the upper layer (vllm, torch, etc) cache.

Implementing the CacheManager interface

First of all, the CacheManager interface does not perform the actual kernel load – it only provides the locations of the cached files and tests for their existence. The loading itself is performed by the binary accelerator driver stubs generated by Triton which take the file names supplied by the CacheManager as arguments. This makes it relatively simple and easy to implement and is done by several projects such as  triton-dejavu. Unfortunately, all of them suffer from the same problem – their CacheManager extensions are project specific and mutually incompatible. 

Using the CacheManager interface to extend Triton functionality

Let’s take a step back and do this by the book. How do you design a cache? What are the essential features?

The first thing which comes to mind is hierarchy and fallthrough.  If we were to design this from scratch, an essential requirement would be to support a list of caches – where do we go looking first, where do we look next and what to do if all possible lookups fail. Triton has no built-in support for such chaining of CacheManagers, so step one should be to implement this by creating a HierarchicalCacheManager – an umbrella manager which handles the chaining and potentially the fall through to Triton, Triton-dejavu or other third party plugins all of which can implement the CacheManager interface.

Once we have this in place we can implement the desired functionality. This could be looking for kernels in multiple locations, where we can load multiple CacheManager instances responsible for different cache directories. An example of this is a prepackaged read-only global cache and a fallthrough cache specific for this Triton process. We could also use this approach to implement different cache  properties, such as read-only cache, read-write cache, and more.

But most importantly, we can now protect some of the caches in the list with signatures even if we keep JIT as a last resort fallthrough.

Implementing signatures

As noted before, we cannot override the actual kernel load. Thus, we have to compute all checksums and check the signatures while processing the file lists (in Triton parlance: “file group”). If the signature does not match, then we raise an exception which in most cases will abort the run. If the signature matches, we return the file list and pass it to actual driver loaders to load and execute the kernel. 

While there are multiple options to implement signature checking, we have decided to enhance the Pathlib module which is used by most of the Triton code base to load and store files. The signature & compression aware pathlib module is already contributed upstream and is available from PiPy. 

Caveat: we cannot implement signing on-the fly while JIT-ing. 

The reason for this is that all kernel and intermediate file paths are stored in the cache group file upon generation. Moving them to a different location will require rewriting the file and will invalidate the signature. 

Thus, the workflow is: 

  • Generate kernels by running at small scale and JIT-ing and/or using ahead of time compilation.
  • Graduate/QA the kernel cache to production
  • Copy the QA-ed kernel cache to a different  location and sign them.


How do Triton Cache Extensions (TCE) work?

Creating a signed Triton cache

Below is an example of using TCE tools to copy a set of cache entries to a different location and sign them using a RSA private key:

Step 1 – Run the Triton code with a clean cache

We will use one of Triton stock examples for this purpose

$ rm -rf ~/.triton/cache/*

$ python3 ./01-vector-add.py > /dev/null

$ ls ~/.triton/cache/
AHYQKAJ6TB6BK53CJ7WW6DIUQDWAJVMJ3A7Z2OW2XXVQBB7HIPFA
HN4JZNUV4NXB5MI55YHVXGWQOUMJKL7IG4GHTS673MBNEEYCZP2Q
XVS4CMNO6DDCOWELDUDXGMATPZBBDHLFOH5OJGCA7XO7RK52VWRQ

Some of the directories are of no interest to us – they contain driver stubs and other binaries which are generated on the fly, but not loaded via the normal kernel load mechanism. 

The ones which are interesting look like this (for an AMD system): 

$ ls ~/.triton/cache/XVS4CMNO6DDCOWELDUDXGMATPZBBDHLFOH5OJGCA7XO7RK52VWRQ
add_kernel.amdgcn  add_kernel.json  add_kernel.ttgir  __grp__add_kernel.json
add_kernel.hsaco   add_kernel.llir  add_kernel.ttir

Step 2 – Create a signed copy of the kernel files in our “signed” cache

Copy all files, update the group file to the new location, sign all files with the key from testpriv.pem

$ python3 copy_entry.py ~/.triton/cache/XVS4CMNO6DDCOWELDUDXGMATPZBBDHLFOH5OJGCA7XO7RK52VWRQ  ~/test-cache --sign-with testpriv.pem

Step 3 – Enable the CacheManager plugin and test the signed cache

Create a hierarchical cache config. This example has only one plugin/cache in the hierarchy – signed cache located in ~/test-cache with the public key located in public-key.pem

Fallthrough to JIT is allowed. If an entry is not found in the signed cache, Triton will compile a kernel and store it in the normal cache at ~/.triton/cache/

{
    "cache_managers": [
        {
            "id":"test_manager",
            "cache_dir":"//home/fedora/test-cache",
            "rsa_key":"//home/fedora/public-key.pem"
        }
    ],
    "fallback":true,
    "debug":true
}

Enable the cache manager plugin via the environment variable, remove the original Triton cache entry and re-run our test code:

$ export TRITON_CACHE_MANAGER="cache:HierarchicalCacheManager"

$ export TCE_CONFIG=`pwd`/config.json

$ rm -rf ~/.triton/cache/XVS4CMNO6DDCOWELDUDXGMATPZBBDHLFOH5OJGCA7XO7RK52VWRQ

$ python3 ./01-vector-add.py

$ ls -laF ~/.triton/cache/XVS4CMNO6DDCOWELDUDXGMATPZBBDHLFOH5OJGCA7XO7RK52VWRQ
ls: cannot access '/home/fedora/.triton/cache/XVS4CMNO6DDCOWELDUDXGMATPZBBDHLFOH5OJGCA7XO7RK52VWR': No such file or directory

Triton has used the copy in our test-cache directory and has checked the signature. JIT was not invoked and it did not try to recompile the kernel.

We can now demonstrate the effect of the signatures by intentionally corrupting the binary.

$ dd if=/dev/random bs=4096 count=1 of=test-cache/XVS4CMNO6DDCOWELDUDXGMATPZBBDHLFOH5OJGCA7XO7RK52VWRQ/add_kernel.amdgcn
1+0 records in
1+0 records out
4096 bytes (4.1 kB, 4.0 KiB) copied, 9.5519e-05 s, 42.9 MB/s

$ python3 ./01-vector-add.py

Instead of the expected result we get a long backtrace which starts with

raise ValueError("Invalid signature")

When should you use TCE?

You’ll benefit from TCE if you:

  • Deploy Triton kernels at scale and care about startup latency
  • Want to confirm that the kernels remain unchanged during a long computation run with multiple node spinup/node spindown events.
  • Want to avoid redundant compilation in CI/CD or runtime environments
  • Operate in security-sensitive or regulated environments
  • Need reproducibility and trust in your GPU execution stack


Summary

Protecting Triton’s cache with cryptographic signatures can be quite useful in long lived computations. It allows users to “mark” kernels as production and prevent unintentional corruption as well as some denial of service attacks. It can also supplement other means of achieving integrity such as read-only container images and cache overlays.