Simplifying Edge AI Builds with Verified GitHub Actions Patterns

by Vance Raiti, Nick Cao | Dec 12, 2025 | AI

As the ecosystem and economy around AI continues to grow and the Internet of Things (IoT) grows smarter and more prolific, a new paradigm of computing is emerging: edge AI. That is, the application of AI technologies to advanced IoT systems. This has all sorts of exciting applications in a wide range of fields from wearable devices in medicine to computer assisted driving in automotive.

However, developers currently face friction when building edge AI applications. Developers often hit a wall when working with edge AI hardware, due in no small part to a fragmented software ecosystem of tools and drivers. Even just getting a device ready for an application (a process called “enablement”) can consume weeks or even months of engineering time as they work to deliver compatibility all the way up the software stack. Further compounding this challenge is that these efforts are frequently siloed within teams, with disparate groups in an organization often independently rolling their own solution. This demonstrates a demand for a standardized, open method for edge AI hardware enablement.

Note: Red Hat’s Emerging Technologies blog includes posts that discuss technologies that are under active development in upstream open source communities and at Red Hat. We believe in sharing early and often the things we’re working on, but we want to note that unless otherwise stated the technologies and how-tos shared here aren’t part of supported products, nor promised to be in the future.

*A stack diagram showing the different layers of software that can sit atop an NVIDIA Jetson platform, a popular SoC for edge AI*

To address this, Red Hat’s Office of the CTO has created the Edge AI Image Pipelines project. It’s a set of GitHub Action pipelines for customizing and building Red Hat Enterprise Linux (RHEL) bootc images that support popular AI frameworks for edge AI using verified patterns. If all you really want to do is build and/or run your own edge AI images using our project, we recommend checking out the project README on GitHub for links and instructions. If you would like some more technical detail on the motivation for this project and the design decisions that were made, that is what this blog is for. First, we will go over the hardware enablement problem: what it is and why it’s so important for edge AI. Then, we will talk about our patterns for building RHEL images for edge AI, wrapping it up with the AI and orchestration applications we are supporting for these images.

The current implementation supports only NVIDIA Jetson Orin Developer Kit boards since those are readily available to our team. Support for additional hardware platforms is planned.

The hardware enablement problem

Before any edge AI applications can exist, we need the edge hardware to host it. The computational demands of AI are steep, as CPUs typically found in IoT generally do not provide enough performance for AI inference workloads. At the same time, the datacenter GPUs where AI applications are typically hosted far exceed the power and price budget for almost any edge environment, so edge AI becomes its own niche for hardware acceleration.

Recognizing this, many companies have developed their own System on Chips (SoCs) specifically for edge AI, generally characterized by embedded CPUs accompanied by low-power, low-precision but high-throughput accelerators, providing the necessary AI performance to host even some small LLMs at human reading speed.

However, the software ecosystem supporting these chips has yet to reach full maturity. The accelerators on these chips often utilize newer and not-as-well-supported architectures as compared to their datacenter counterparts. For example, the NVIDIA Jetson Orin (a prominent chip in edge AI today) uses ARM CPUs, a special “SM_87” GPU architecture, and a number of specialized processors such as DLA and PVA. Fundamental hardware details such as these cannot be entirely abstracted away from the development process. This creates compatibility (i.e. enablement) problems when deploying software stacks that weren’t designed with these specific edge platforms in mind.

This is what we try to address with the verified patterns in Edge AI Images Pipelines, starting with common challenges when working on these platforms and providing configurations or patches to overcome them. By doing so, we intend to minimize the development hours dedicated to resolving issues like driver compatibility and software target architectures that arise from deploying AI software stacks on untested hardware.

The base image

We know the importance of hardware enablement, so where do we start? Well, we can begin by creating a base image: one with all the basic system packages and drivers needed to run our class of applications. In this project, we do this using bootc, a system for building bootable images using container technology.

The base image contains two software components critical for AI workflows: RHEL Jetson-compatible CUDA drivers, and the NVIDIA container-device interface (CDI) toolkit. The CUDA drivers allow AI applications running on RHEL systems to communicate with the NVIDIA GPUs for acceleration. The NVIDIA CDI, as discussed in this post, allows containers running on the system to access GPUs, which will be used later as we containerize our RHEL-Jetson applications.

The base image is created using a Konflux pipeline. Konflux is a Red Hat CI/CD system that uses Tekton to build container images with a stronger security posture, and is used in this project whenever we need a little more power or security than what the free GitHub Actions runners provide. This build process is unfortunately not open source since it redistributes certain proprietary NVIDIA binaries, but the final image is accessible from this quay repository.

Let’s check it out. In this example, I’ve flashed an NVIDIA Jetson Orin NX 16GB with quay.io/redhat-user-workloads/octo-edge-tenant/rhel-bootc-jetson-rhel94:latest

Just to show basic functionality, let’s run a matrix multiply CUDA kernel and compare its performance to a CPU execution (sample code here):

root@9a3e2d7c7b6d:/workspace/mmm# time ./mmm_cpu 2000 | head -1 # 2000x2000 on CPU

real 1m3.438s

root@9a3e2d7c7b6d:/workspace/mmm# time ./mmm_gpu 2000 | head -1 # 2000x2000 on GPU

real 0m7.517s

Nice!

Applications

Now that we have a basic system booting and capable of hardware acceleration, it’s time to enable some runtimes. These serve to abstract the edge AI hardware away from the core AI logic, allowing developers to use all the same processes on edge as they would on the cloud, only worrying about resource parameters such as compute and power budget.

We package each runtime as its own container. This is to simplify the build pipelines as well as ease integration with cloud technologies (e.g. MicroShift). These containerized applications use a physically bound model: they are packaged into the bootc image itself (as opposed to being “logically bound”). This is to enable these systems in network-constrained environments characteristic of edge computing where the system would need to be robust to intermittent or even completely absent network connectivity.

As a start, we enable three popular inference runtimes: vLLM, Triton Inference Server, and Ollama, as well as MicroShift for orchestration. For each runtime, we’ve created a GitHub workflow to build and customize images containing it. Additionally, we’ve made prebuilt disk images for each runtime and uploaded them as artifacts to the project’s Quay repository for quick experimentation.

While Ollama is straightforward to support, vLLM and Triton Inference Server both have their intricacies that we will go into detail on below.

Enabling PyTorch and vLLM

vLLM, powered by PyTorch, is Red Hat’s solution for high-performance LLM inference. Since LLMs are the centerpiece of the current excitement around AI, it is critical that we support this platform on NVIDIA Jetson. However, this application also ends up being the most difficult to enable. This is because of a peculiarity of the Jetson Orin: all CUDA kernels must be compiled specifically for SM_87, the board’s special GPU architecture. Other NVIDIA GPUs offer standard minor version compatibility: kernels compiled for an SM_XY gpu can run on an SM_XZ gpu as long as Z >= Y. However, the Jetson family does not offer this, and since no other GPU families share the Jetson Orin’s SM_87 arch, if a program has not already been compiled specifically for the Jetson architecture, we will have to compile it ourselves.

So, what does support for SM_87 look like? Both vLLM and PyTorch have their own CUDA kernels built into their wheels (python packages). vLLM compiles them for SM_87 architecture, but PyTorch does not, meaning we must build PyTorch ourselves. Luckily, PyTorch provides public build containers for this exact purpose. Unluckily, PyTorch is a large project whose build process takes up to many hours to build and exceeds the resource budgets of modestly-sized systems such as the GitHub runners we normally work with, so we buy ourselves some extra power by leaving this to another Konflux pipeline, the code for which can be found in this GitLab repository.

Once this is built, we can install vLLM (almost) like normal in a container that is ready to be pulled and run on Jetson! There are instructions on how to use and customize this image in the project README. Additionally, we’ve made a video showing how to use the prebuilt vLLM image:

Enabling Triton Inference Server

Triton Inference Server (not to be confused with the unrelated triton-lang Python DSL), powered by NVIDIA TensorRT, is NVIDIA’s cloud AI inference solution for NVIDIA hardware platforms. Critically, the TensorRT runtime allows us to use the NVIDIA Deep Learning Accelerator (DLA), a purpose-built co-processor on the Jetson devices for accelerating AI applications, in addition to the GPU. Because NVIDIA provides Triton containers for Jetsons already, there’s little we need to do to enable the actual software. The challenge instead comes from model compilation: the TensorRT runtime requires a special “engine” model format that must be compiled natively, meaning we cannot use the generic GitHub runners to generate these engines.

Instead, we need to integrate an NVIDIA Jetson into our CI. In Edge AI Images Pipelines, we do this by defining a special workflow that uses Jumpstarter, an open source hardware-in-the-loop (HIL) framework allowing for remote access to embedded systems and their peripherals. Unfortunately, setting up a Jumpstarter board is somewhat complicated and outside the scope of this blog, so if you’d prefer to compile TensorRT engines manually, there is an option to do that instead.

Fortunately, once Jumpstart is set up, the process is not much more complicated than any other build process: just transfer the .onnx model files to the device and pass them to the TensorRT CLI binary. The GitHub runner will then consume the compiled models and produce a ready Triton Inference Server image.

Like with vLLM, the README has more thorough instructions. We also have a demo of using Triton Inference Server prebuilt image on MicroShift:

Building applications

We use GitHub Actions as our CI/CD platform for building bootc images and for compiling them to flashable disk images. Because of this, building your own RHEL Jetson images is as simple as forking the repository and running the associated `workflow_dispatch`. If all you really want to do is try out one of the applications we’ve supported (vLLM, Ollama, Triton Inference Server), it’s even easier than that. We have flashable images available in this quay repository and instructions on how to use them in the GitHub README.

Customizing and building your own applications will be more complicated, but since we use GitHub Actions and bootc images as our framework, you’ll get all the flexibility, reusability, and extensibility that normally comes with those technologies. Each pipeline to build one of the AI runtimes has options to add your own models, and the RHEL bootc image builder action has options to add additional application containers for additional logic.

Wrap up and future work

Every new computing paradigm, from the cloud to containers to DevSecOps, eventually develops a set of standard, verified patterns that make developers’ lives easier. Edge AI is no different. The Edge AI Images Pipelines explores what some of these patterns should look like for edge AI, and produces an easy way to get up and running with Red Hat’s AI stack on NVIDIA Jetson.

A few key takeaways from this project:

The enablement challenge: Developers currently face significant friction due to the fragmented software ecosystem surrounding edge AI hardware.
A standardized solution: The Edge AI Image Pipelines project offers verified patterns via GitHub Actions for building Red Hat Enterprise Linux bootc images for AI runtimes and edge AI platforms
Open and accessible: Developers can utilize prebuilt images for quick experimentation or fork the repository to create their own custom edge AI applications.

This, like all of Red Hat’s products, follows an open source development model, so please, as you run into issues or even find patches of your own, please open an issue or a PR on the GitHub.

As mentioned earlier in the blog, NVIDIA Jetson enablement is only the beginning of this project. AI is a rapidly evolving field, edge AI especially so, and as more hardware and software platforms gain traction in edge, we intend to build this project out to accommodate them as well, so stay tuned!

view posts

Simplifying Edge AI Builds with Verified GitHub Actions Patterns

Explore

Privacy statement

Terms of use

All policies and guidelines

About