In a prior blog post, we provided an overview of the Triton language and its ecosystem. Triton is a Python based DSL (Domain Specific Language), compiler and related tooling designed for writing efficient GPU kernels in a hardware-agnostic manner, offering high-level abstractions while enabling low-level performance optimization for AI and HPC workloads.
In this post, we describe some guidelines to get started with developing and using Triton kernels, in combination with, or independent of, frameworks such as PyTorch, and to be able to do so using Red Hat Universal Base Image (Red Hat UBI)/ Fedora-based containers, while leveraging the AMD Instinct family of AI accelerators such as the MI200 and MI300 families of devices.1
Red Hat’s Emerging Technologies blog includes posts that discuss technologies that are under active development in upstream open source communities and at Red Hat. We believe in sharing early and often the things we’re working on, but we want to note that unless otherwise stated the technologies and how-tos shared here aren’t part of supported products, nor promised to be in the future.
Step 0: Hardware setup
In this blog we assume you have already deployed a Linux server (such as a physical machine or VM running Fedora 41) with AMD Instinct GPUs and with AMD GPU device drivers installed. For the examples below we used a server running Red Hat Enterprise Linux v9.4 equipped with AMD MI-300X GPUs. Let’s proceed from there onwards.
Step 1: Build and run the development container
The artifacts and examples used in this article are available in this repo. The following steps assume you have cloned this repo into your GPU equipped server and are working from the folder which has the Containerfile. From a quick review of the contents of the sample Containerfile, we see that it uses a Red Hat UBI image as the base, adds a slightly older Python version as needed by the PyTorch version, adds RPMs that directly package AMD ROCm driver libraries and finally adds relatively recent versions of PyTorch with ROCm support. Installing PyTorch versions newer than 2.x (ideally newer than 2.3.0) will also install Triton, although Triton can also be installed independent of PyTorch if PyTorch is not being used. For more information on the benefits of Red Hat UBI images, refer to this article.
Build a container image using this Containerfile using tooling such as podman.
podman build -t torch-triton-amd-ubi -f ./Containerfile.ubi
This should build a local container image successfully. Now run it as indicated below
sudo podman run --rm -it --device=/dev/kfd --device=/dev/dri --group-add=video --security-opt=label=disable --cap-add=SYS_PTRACE --env HIP_VISIBLE_DEVICES=0 torch-triton-amd-ubi:latest
Note that in this example, we run the container with some elevated capabilities, in order to get better visibility and controls for some learning and troubleshooting exercises. In production deployments, elevated capabilities should not be used. It is recommended to use a tighter set of access privileges for better security. In future blog posts, we will cover some of those operational and security guidelines. Multiple GPUs can be exposed into this container by supplying a list of ids to the environment variable HIP_VISIBLE_DEVICES. For starters, we are just exposing a single GPU (GPU-0).
Once the container is running successfully, you should be in a bash shell inside the container. Here you can use some ROCm utilities to check for the available GPUs and other ROCm info. Amd-smi and rocminfo are two such tools bundled into this particular container that you can try out.
Here are the commands that you will use (for ease of copy / paste):
amd-smi version
and
amd-smi static -g 0 | grep MARKET_NAME
Here is what the output should look like:
[root@1d47d4c3fcdd workspace]# amd-smi version AMDSMI Tool: 24.6.2+2b02a07 | AMDSMI Library version: 24.6.2.0 | ROCm version: 6.2.0 [root@1d47d4c3fcdd workspace]# amd-smi static -g 0 | grep MARKET_NAME MARKET_NAME: MI300X-O
Step 3: Basic PyTorch and Triton bring up, verification and GPU detection
After confirming that the devices are visible, you can move on to exercising PyTorch and Triton. Examine and execute the following program (examples/torch-triton-gpu-checks.py) provided in the repo to verify that PyTorch and Triton packages work in this container, are able to detect the AMD GPUs and get familiar with the usage of some device-related PyTorch and Triton apis.
Here is the command that you will use (for ease of copy / paste):
python ./examples/torch-triton-gpu-checks.py
And here is what the output should look like:
[root@1d47d4c3fcdd workspace]# python ./examples/torch-triton-gpu-checks.py PyTorch version : 2.5.0+rocm6.2 Triton version : 3.1.0 PyTorch detected GPU ? : True PyTorch detected GPU name : AMD Instinct MI300X VF Triton backend : hip PyTorch detected hip : 6.2.41133-dd7f95766 Test CPU tensor is : tensor([1., 2., 3., 4.]) is_cuda: False Test GPU tensor is : tensor([1., 2., 3., 4.], device='cuda:0') is_cuda: True
From this output (and a quick check of the source code of this program) you can confirm the version numbers of various components running and also that both PyTorch and Triton correctly detect the AMD GPUs. This also checks and confirms that you can create tensors both in CPU memory and in GPU memory which are baseline requirements of the setup. A key point to note is that AMD ROCm drivers are designed to work seamlessly with all of PyTorch’s CUDA-related apis (viz. all cuda tensor functionality in the torch.cuda package , including everything from torch.cuda.get_device_capability() to torchTensor.is_cuda and so on). You can check even the output and source code of this sample program (for instance here and here) to see that we use the “cuda” qualified PyTorch methods and operators and “cuda” as the device type for tensors that are resident in GPU memory, even when dealing with AMD GPUs. This is intentional by AMD (and other GPU vendors) to enable quick and seamless re-use of code originally written for GPUs from other vendors. Refer to this article for more information on the re-use and applicability of cuda semantics for AMD’s ROCm and HIP software. At the same time there are also some methods available to get information that is specifically for ROCm or HIP (see for example the use of torch.version.hip in our sample program). At the time of writing this article, some aspects of device naming terminology are continuing to evolve in these different projects so it is recommended that you check the latest version of device naming terminology requirements for versions of particular PyTorch and/ or Triton methods you use.
Step 4: Examine and run a hand-written Triton kernel
Next, let us try running a hand written Triton kernel directly. For this case, we can use the following example program provided in the repo, examples/triton-vector-add.py. This example shows the design of a hand coded Triton kernel for performing GPU based additions of large vectors. This program is mostly a clone of a similar program from the upstream Triton tutorials and illustrates key concepts such as the Triton programming model, use of the @triton.jit decorator, program_id and other Triton language primitives, some basics of Block centric programming as well as then initiating some benchmarking tests. If the program works correctly, you should see output similar to that shown below and it should indicate that there was no difference in the result tensor when calculated via PyTorch on the CPU vs via using the hand-written Triton kernel to offload the compute portion of the application to the GPU device. We will cover benchmarking and performance profiling in more detail in future blogs.
Here is the command you will use:
python ./examples/triton-vector-add.py
And here’s what the output should look like:
[root@25017227e553 workspace]# python ./examples/triton-vector-add.py Large vector addition example using torch, triton tensor([1.3713, 1.3076, 0.4940, ..., 0.6724, 1.2141, 0.9733], device='cuda:0') tensor([1.3713, 1.3076, 0.4940, ..., 0.6724, 1.2141, 0.9733], device='cuda:0') The maximum difference between torch and triton is 0.0 vector-add-performance: size Triton Torch 0 4096.0 5.424866 5.894232 1 8192.0 11.676447 11.788464 2 16384.0 22.914685 23.130352 <snip>
Step 5: Use PyTorch with torch.compile to generate and run a fused Triton kernel
In the previous step we executed a hand written Triton kernel. In this step, we illustrate the use of PyTorch’s code generation capabilities to automatically generate an optimized kernel that represents a fusion of multiple PyTorch operations and would require multiple separate GPU kernels if implemented in a naive or non-optimized manner. Refer to our previous blog post and PyTorch.compile documentation for information on kernel performance optimization techniques and kernel fusion as well as PyTorch’s ability to do automatic code generation of Triton kernels.
To see this in our current environment, we can run the provided sample program, examples/torch-compile.py from our repo. Here you see that the target operation in the function foo() has 3 logically separate operations (point-wise sin(), point-wise cos() and a tensor addition) which naively would have meant 3 separate kernels with significant performance penalties for storing interim results back to global memory between each such operation. However, by passing this function through torch.compile(), we trigger PyTorch to generate an optimized and fused kernel that combines all 3 operations and applies any optimizations it sees feasible. You can view this interim generated code and optimizations by running this program with the TORCH_COMPILE_DEBUG flag set to True as shown in the sample output below. When this is used, PyTorch generates a lot of debug information including a copy of the auto-generated Triton kernel(s). In the example below, the output provides a path where files related to various optimizations are saved and as can be seen, the generated Triton kernel lives in the output_code.py file and can be examined for analysis or to use as a starting point for a separate hand coded Triton kernel. You will need to use the pathname output in your test run.
Here are the commands that you will use:
TORCH_COMPILE_DEBUG=1 python torch-compile.py
and
cat /PATH/TO/output_code.py | grep "triton.jit" -A 3
And here’s what the output should look like:
[root@1d47d4c3fcdd examples]# TORCH_COMPILE_DEBUG=1 python torch-compile.py W1217 06:50:59.322000 416 torch/_inductor/debug.py:434] [0/0] model__0_inference_0 debug trace: /workspace/examples/torch_compile_debug/run_2024_12_17_06_50_57_987701-pid_416/torchinductor/model__0_inference_0.0 tensor([[-0.6096, 0.9892, 0.8039, ..., 0.3115, -0.5144, -0.0134], [ 1.3055, 1.5004, 1.1269, ..., 0.5491, 0.7383, 0.4786], ..., [-0.0205, 0.5738, 1.2120, ..., 0.7075, 1.5020, 0.6215]], device='cuda:0') [root@1d47d4c3fcdd examples]# ls /workspace/examples/torch_compile_debug/run_2024_12_17_06_50_57_987701-pid_416/torchinductor/model__0_inference_0.0 fx_graph_readable.py fx_graph_transformed.py ir_pre_fusion.txt fx_graph_runnable.py ir_post_fusion.txt output_code.py [root@1d47d4c3fcdd examples]# cat /workspace/examples/torch_compile_debug/run_2024_12_17_06_50_57_987701-pid_416/torchinductor/model__0_inference_0.0/output_code.py | grep "triton.jit" -A 3 @triton.jit def triton_(in_ptr0, in_ptr1, out_ptr0, xnumel, XBLOCK : tl.constexpr): xnumel = 65536 xoffset = tl.program_id(0) * XBLOCK
Conclusion:
In this article we have demonstrated how you can get started with PyTorch and Triton kernel development, testing and analysis in an environment that is based on Red Hat containers (using Red Hat UBI images) and the AMD Instinct family of AI accelerators. You are encouraged to leverage and build upon the sample code provided to begin your own journey with these important technologies. Good luck!
Acknowledgements:
Thanks to Joseph Groenenboom for support as we put together this article and Nikhil Palaskar for loaning us the hardware setup.
- AMD Instinct is a Trademark of Advanced Micro Devices Corporation ↩︎