GPU Doctor

A curated catalog of ML Docker base images to help engineers escape "CUDA hell" and choose the right container for their GPU workloads.

The Problem

Selecting a Docker base image for ML is painful:

CUDA/cuDNN version matrices are confusing
Image sizes range from 1GB to 20GB+ with unclear trade-offs
Documentation is scattered across NVIDIA, cloud providers, and framework maintainers
Security scan data is hard to find

The Solution

GPU Doctor provides:

Structured catalog of vetted ML images (PyTorch, TensorFlow, vLLM, TGI, Triton, etc.)
Consistent metadata including CUDA versions, driver requirements, sizes, and security ratings
Guided picker to find the right image for your use case
Searchable table with filtering by framework, role, and cloud affinity

Quick Start

# Browse the catalog
cat data/images.json | jq '.images[] | {id, name, role: .capabilities.role}'

# Validate after edits
.venv/bin/python -c "
import json
from jsonschema import validate
with open('data/schema.json') as f: schema = json.load(f)
with open('data/images.json') as f: data = json.load(f)
validate(instance=data, schema=schema)
print('Valid')
"

Project Status

JSON Schema design
Comprehensive catalog (200+ images covering PyTorch, TensorFlow, vLLM, TGI, Ollama, CUDA bases, NGC images for JAX, NeMo, Triton, and more)
Website with guided picker (5-step wizard) and table view
CI automation for catalog updates from registries
Security scan enrichment (Trivy integration)
Claude skill for recommendations

Catalog Automation

The catalog is automatically updated from container registries:

# Install dependencies
pip install -r scripts/requirements.txt

# Dry-run (preview changes without writing)
python scripts/update_catalog.py --dry-run --source all

# Update from all registries (Docker Hub, GHCR, NGC)
python scripts/update_catalog.py --source all

# Update from a specific registry
python scripts/update_catalog.py --source dockerhub

How it works:

data/tracked_images.yaml defines which images to track
The script fetches metadata from registries, parses tags, and builds catalog entries
Curated fields (notes, recommended_for, etc.) are preserved during updates
A GitHub Actions workflow runs weekly and creates PRs with updates

Audit / Security scans

To run lightweight security scans against the catalog:

python scripts/audit_catalog.py --dry-run --mode security --max-images 5

Before scanning images from public.ecr.aws, authenticate once per shell session:

aws ecr-public get-login-password --region us-east-1 \
  | docker login --username AWS --password-stdin public.ecr.aws

Alternatively, set TRIVY_REGISTRY_TOKEN to that password so Trivy can reuse it without an explicit docker login. Scan artifacts and cache live under data/.audit, which keeps reruns fast.

Website

The website is a Next.js app in web/:

cd web
yarn install
yarn dev

Pages:

/guide - 5-step wizard to find the right image based on workload, environment, frameworks, and priorities
/table - Searchable/filterable table of all images
/images/[id] - Detailed view of a single image with specs, security info, and quick-start commands

Name		Name	Last commit message	Last commit date
Latest commit History 51 Commits
.github/workflows		.github/workflows
assets		assets
data		data
scripts		scripts
web		web
.gitignore		.gitignore
AGENTS.md		AGENTS.md
CLAUDE.md		CLAUDE.md
README.md		README.md
changelog.md		changelog.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

GPU Doctor

The Problem

The Solution

Quick Start

Project Status

Catalog Automation

Audit / Security scans

Website

About

Uh oh!

Releases

Languages

zenml-io/gpudoctor

Folders and files

Latest commit

History

Repository files navigation

GPU Doctor

The Problem

The Solution

Quick Start

Project Status

Catalog Automation

Audit / Security scans

Website

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Languages