A developer’s guide to PyTorch, containers, and NVIDIA – Solving the puzzle

by Steven Pousty | Aug 26, 2025 | AI, Developer Productivity

Starting about 5 years ago, I began moving to container-based operating systems (OS). It started with Bazzite and most recently I have been using Aurora. What’s not to love? These OS’s make containers first-class citizens, simplifying how to “install” and run applications, making experimentation and cleanup easy, and letting you use the great work of others as a starting point for your own work.

I have been using containers since before Docker; an early version of Red Hat OpenShift used Red Hat’s own container format called Gears. Then I spent four years working with and teaching people about Docker/OCI containers as OpenShift moved to Kubernetes.

Recently, I have returned more intensely to my scientist roots and have dived back into data analysis, statistics, machine learning, and artificial intelligence (AI). My work has been done primarily on Linux, but not always with containers. I put together several devcontainers/GitHub Codespaces for teaching workshops. I also occasionally would use Distrobox to isolate an environment when I didn’t want npm sprawled all over my machine.

Since I have returned to Red Hat, I have been trying to move full-time to a containerized OS and shift my workflows to be entirely container-based. The process has been “interesting.” While trying to work through getting PyTorch in containers on a containerized OS with NVIDIA hardware I think I have alienated the Universal Blue AI Discord channel. Even with my experience in containers, Linux, and data science, the process was not straightforward and was quite frustrating. I recently have reached a clearer understanding and I want to share this in the hopes of sparing others this “that which does not kill you makes you stronger” experience.

Note: Red Hat’s Emerging Technologies blog includes posts that discuss technologies that are under active development in upstream open source communities and at Red Hat. We believe in sharing early and often the things we’re working on, but we want to note that unless otherwise stated the technologies and how-tos shared here aren’t part of supported products, nor promised to be in the future.

How the pieces fit together

To start, let’s lay out the components. At the base level, you have your hardware, for this post, a NVIDIA GPU. On top of this, you are running a containerized OS that contains the NVIDIA Driver. Then you have your container runtime—either Docker or, my choice, Podman. Next is the NVIDIA Container Toolkit. Then in your container you have the CUDA Toolkit, which contains the libraries that make NVIDIA GPUs so good for AI/ML software. Finally, within the container, you have your PyTorch libraries and code.

A new term for you in the last paragraph might be the NVIDIA Container Toolkit. This is a crucial utility that allows applications inside your container to talk to the NVIDIA driver on your host OS. It provides the necessary runtime and libraries to bridge the gap between the container and the host’s NVIDIA driver. It is installed on your host system. Think of it as the enabler that makes the GPU visible and usable by the container. You only need to have the NVIDIA driver and the NVIDIA Container Toolkit installed on your host machine, not the full CUDA Toolkit.

Once you have the Container Toolkit installed, you can expose the GPUs to the container using the following syntax in your Podman (or Docker) command:

podman run --rm -it --gpus all yourorg/yourcontainer nvidia-smi

This exposes all NVIDIA GPUs on your computer to the container. The Container Toolkit gives you all sorts of command line flags for how you expose your GPUs to the container.

What does it look like

Here is a diagram of how all the pieces fit together, starting with the hardware on the bottom and your application code on top:

┌──────────────────────────────────────────────────┐
│               Your PyTorch Project               │
├──────────────────────────────────────────────────┤
│                     PyTorch                      │
├──────────────────────────────────────────────────┤
│                  CUDA Toolkit (SDK)              │  <-- Inside the Container
├──────────────────────────────────────────────────┤
│              Container's Base OS                 │
└┬────────────────────────────────────────────────┬┘
 │          NVIDIA Container Toolkit Bridge       │
┌┴────────────────────────────────────────────────┴┐
│         Container Runtime (Podman/Docker)        │
├──────────────────────────────────────────────────┤
│           NVIDIA Driver + Host OS                │  <-- On the Host Machine
├──────────────────────────────────────────────────┤
│                  NVIDIA GPU Hardware             │
└──────────────────────────────────────────────────┘

Of course Distrobox has a slight wrinkle…

If you use Distrobox to spin up your containers to do work you need to be aware of one slight problem. In an effort to make things easier for the end user, Distrobox added a flag to their create command --nvidia. What this actually does is bind-mounts NVIDIA drivers/libraries from host (/usr/lib64/nvidia, /usr/bin/nvidia-*, etc.) into standard container paths. In this case you are actually using the drivers from the Host. While this usually works, it has several potential problems:

It’s a simple find-and-mount approach, not sophisticated device detection
This can cause conflicts because it mounts ANY file containing “nvidia” in the name, even unrelated ones
It “relies on Filesystem Hierarchy Standard (FHS) paths to detect Nvidia files and libraries”, which breaks on non-FHS systems like NixOS

A safer and more standard option is to install the NVIDIA Container toolkit in your Host OS and use it with Distrobox to expose the GPUs. Distrobox has a flag --additional-flags which allows you to pass flags to the system container runtime. This is the more reliable and standard way to expose your GPU to your containers:

distrobox create --name my-distrobox --image yourorg/yourcontainer --additional-flags "--gpus all"

Please refer above to the different options you can pass with --gpus.

A critical clarification: `nvidia-smi` and the CUDA version

There is a common point of confusion that really hindered my understanding of how things were wired together. When you run the nvidia-smi command, you’ll see a CUDA version listed in the top right corner.

+--------------------------------------------------------------------------+
| NVIDIA-SMI 570.153.02  Driver Version: 570.153.02   CUDA Version: 12.8   |

This does not mean that the CUDA Toolkit is installed in either your host or your container. It only indicates the maximum version of CUDA that the installed driver supports. It’s a statement of capability, not of what’s actually installed.

The only reliable way to know if the full CUDA development toolkit is installed in your environment is to run the following command:

nvcc --version

If it prints a version number, the toolkit is installed. If you get command not found, it is not.

Which container to use as your base

There is an important distinction to make in your PyTorch work. Some projects you work on will never need to compile new code, such as just consuming a pre-trained Transformer model from Hugging Face. But other projects, such as vLLM or those using DeepSpeed, heavily leverage compiling new code against the CUDA libraries on the fly. Later there will be more discussion of the compiling use case. But these two use cases form a critical fork in which container you can use as the base container for your PyTorch work.

The easy case: No compilation needed

This scenario is straightforward and covers a surprising number of use cases. If your project only uses standard PyTorch functions and pre-trained models without needing to compile custom code, you can use almost any base container. The reason this works is that modern PyTorch “wheels” (the binary packages you install with pip) come bundled with the essential CUDA runtime libraries they need to function.

Here’s how you’d set this up using Distrobox:

Create a standard Distrobox container with GPU access:

distrobox create --nvidia -n simple-project --image registry.fedoraproject.org/fedora-toolbox:latest

Enter the container and install your tools: This command explicitly tells pip to download PyTorch and its dependencies from the repository containing builds for CUDA 12.8, which is the CUDA library compatible with version 2.7 of Pytorch.

distrobox enter simple-project

python -m venv .venv

source .venv/bin/activate

pip3 install torch==2.7 torchvision torchaudio --index-url https://download.pytorch.org/whl/cu128 # or whichever version of PyTorch is referenced in your requirements.txt. Be sure to pick a index URL that is available for your PyTorch version. 

# https://pytorch.org/get-started/previous-versions/

pip install transformers #(or any other dependencies)

Run your code: You can now run a simple script that downloads and uses a model, and it will work because the bundled libraries are sufficient.

The not-so-easy case: Your project needs to compile

This is where things get tricky. The moment your project needs to build something new, the bundled libraries aren’t enough—it needs the full CUDA “toolbox,” not just the runtime libraries.

1. What kinds of projects need to be compiled? Any project that has a dependency that performs Just-In-Time (JIT) compilation to create optimized code for your specific GPU will require the full CUDA Toolkit.

Examples: Projects using libraries like vLLM, DeepSpeed, CuPy, or custom CUDA kernels written in C++.
Things to look for: Check your project’s requirements.txt. If you see packages like triton, xformers, flash-attn, or paged-attention, it’s a huge red flag that you’ll need a full development environment.

2. Start with an NVIDIA-provided container. This is the most crucial step. Instead of a generic container, use one of NVIDIA’s official PyTorch containers from their NGC registry. These are purpose-built with a matching CUDA Toolkit and PyTorch version installed and validated. Each release publishes which versions PyTorch and other libraries are included in the image.

Note: there is no correspondence between the container image tag and either the PyTorch or the CUDA libraries contained in the image. The only way to determine which container you want to pull is either look at the release notes page or the Frameworks Container Support Matrix.

distrobox create --nvidia -i nvcr.io/nvidia/pytorch:25.04-py3 -n advanced-project

distrobox enter advanced-project

# Clone your project and install the modified dependencies

git clone <your-project-repo>

cd <your-project>

3. Modify your project’s requirements. Since the container already provides a known-good version of PyTorch, you must prevent pip from overwriting it with a potentially incompatible version from the public index.

Why? The default PyTorch on PyPI is built for a different CUDA version than the one inside the NVIDIA container. Installing it will create library conflicts and “symbol mismatch” errors.
How? Create a new requirements file that excludes torch and related packages.

grep -vE 'torch|xformers|deepspeed' requirements.txt > requirements_modified.txt

4. Run your project in the new environment.

pip install -r requirements_modified.txt

# Now, run your application

python main.py

Demystifying the hurdles: Key concepts explained

Now that we have been through the whole flow once, let’s revisit some of the key highlights. Here are the core concepts that, once understood, make everything clearer.

Driver vs. Toolkit: The NVIDIA Driver (on your host) is the low-level software that lets your OS talk to the GPU. The CUDA Toolkit (inside your container) is the high-level Software Development Kit (SDK) with compilers and libraries that PyTorch needs to build and run complex code. The --nvidia flag in Distrobox or the --gpus-all flag only provides the driver; you are responsible for providing the toolkit.
Why Simple Projects “Just Work”: The default pip install torch command downloads a PyTorch version with its own tiny, bundled set of essential CUDA runtime libraries. This is enough for running pre-compiled models but is not a full development environment.
The Pitfalls of Manual Installation: Trying to install the CUDA toolkit into a generic container with a package manager (dnf in Fedora) can be very fragile. You can run into disabled repositories, missing repository URLs for new OS versions, and finally, a fundamental file conflict where the package manager tried to overwrite files that the NVIDIA Container Toolkit had already mounted from the host.

Just give me one solution

If you read all this and you are wondering what I recommend, let me try to lay out the most reliable and simplest solution that will work in more use cases:

1. Use the NVIDIA NGC Containers. They are the most reliable solution.

No Conflicts: You get a pre-built, pre-validated stack, where you are guaranteed that the PyTorch version matches the CUDA Toolkit version.
Fine-Grained Control: You can pick a container with the exact versions of PyTorch, CUDA, and cuDNN that your project needs.
Performance: More of the underlying stack is optimized to take advantage of CUDA, often leading to better performance.

2. Use a Devcontainer for a Better Workflow. Unless you really want to work exclusively in the terminal, use a Development Container (devcontainer) rather than Distrobox. The container automates the entire setup and provides a much better development experience. For example both VS Code and JetBrains IDEs work natively with devcontainer.

Why? It provides a reproducible, version-controlled environment that you can launch with a single click in VS Code or JetBrains PyCharm. It’s the best way to package and share your development environment with others.
Example devcontainer.json: This file, placed in a .devcontainer directory in your project, sets up the entire environment automatically.

{
  "name": "PyTorch-2-7-CUDA-Dev",
  "image": "nvcr.io/nvidia/pytorch:25.04-py3",

  // Arguments for the container runtime
  "runArgs": ["--gpus=all"],

  // VS Code specific settings
  "customizations": {
    "vscode": {
      "extensions": [
        "ms-python.python",
        "ms-python.vscode-pylance",
        "ms-toolsai.jupyter"
      ]
    }
  },

  // Command to run after the container is created
  "postCreateCommand": "pip install -r requirements_modified.txt"
}

Wrap up: Welcome to this brave new world

The key takeaway is this: you must distinguish between projects that only run pre-compiled code and those that need to compile new code, and then choose your container environment accordingly.

Moving your AI/ML development workflows into containers provides incredible benefits in reproducibility and portability, allowing you to build, test, and share complex environments with confidence.

I hope this guide helps you on your way. I’m looking forward to any feedback and seeing what you build!

Steven Pousty

view posts

A developer’s guide to PyTorch, containers, and NVIDIA – Solving the puzzle

How the pieces fit together

What does it look like

A critical clarification: `nvidia-smi` and the CUDA version

Which container to use as your base

The easy case: No compilation needed

The not-so-easy case: Your project needs to compile

Demystifying the hurdles: Key concepts explained

Just give me one solution

Wrap up: Welcome to this brave new world

Explore

Privacy statement

Terms of use

All policies and guidelines

About

A developer’s guide to PyTorch, containers, and NVIDIA – Solving the puzzle

How the pieces fit together

What does it look like

A critical clarification: nvidia-smi and the CUDA version

Which container to use as your base

The easy case: No compilation needed

The not-so-easy case: Your project needs to compile

Demystifying the hurdles: Key concepts explained

Just give me one solution

Wrap up: Welcome to this brave new world

Explore

A critical clarification: `nvidia-smi` and the CUDA version