feat(ml): update ONNX Runtime, OpenVINO and ROCm stack #23458

savely-krasovsky · 2025-11-01T02:46:28Z

Description

I noticed that Intel Battlemage GPUs are not supported by Immich. This means that, unfortunately, owners of B570, B580, Arc Pro B50, and B60 cannot use GPU acceleration with Immich.

Since the B50 debuted only in Sep 2025, I had to update the OpenVINO image from Bookworm to Trixie (otherwise the .deb packages from intel-graphics-compiler and compute-runtime wouldn't install) and update Python from 3.11 to 3.13. This allowed me to use the newest onnx, onnxruntime-gpu, and onnxruntime-openvino with Arc Pro support.

Due to that dependency bump I also had to refresh ROCm stack.

Fixes #21190.

How Has This Been Tested?

I tested it on personal setup with Arc Pro B50, unfortunately I don't have another hardware to test regress with CUDA/ROCm and other providers.

Tested various ML scenarios with Arc Pro B50 GPU.
Tested Nvidia/AMD and other SOCs.

Screenshots (if appropriate)

Checklist:

I have performed a self-review of my own code
I have made corresponding changes to the documentation if applicable
I have no unrelated changes in the PR.
I have confirmed that any new dependencies are strictly necessary.
I have written tests for new code (if applicable)
I have followed naming conventions/patterns in the surrounding code
All code in src/services/ uses repositories implementations for database calls, filesystem operations, etc.
All code in src/repositories/ is pretty basic/simple and does not have any immich specific logic (that belongs in src/services/)

Please describe to which degree, if any, an LLM was used in creating this pull request.

Was not used at all.

machine-learning/pyproject.toml

machine-learning/Dockerfile

savely-krasovsky · 2025-11-01T04:04:04Z

@mertalev I noticed strange behaivior while testing this PR: GPU works fine for 5-10 minutes and then immich-machine-learning starts to spam this into logs and GPU fans go brrr. Though I tested it only with OCR. I am not sure if it's related to my PR. If I understand correctly GPU cannot do multiple tasks at the same time, but machine-learning service continue to feed it anyway.

I can open separate issue if necessary.

mertalev · 2025-11-01T04:10:23Z

It should be able to do multiple tasks at the same time. I haven't seen this before, so I think it's likely related to the ocl driver change in this PR. Not sure if this is a known upstream issue or what the root cause is.

savely-krasovsky · 2025-11-01T04:11:10Z

@mertalev okay, I will do some research during weekend.

savely-krasovsky · 2025-11-01T04:21:04Z

But I would test it anyway, especially with older hardware (even B570/B580 could be considered older, since I use Arc Pro B50 and for now I believe it's some driver bug, since GPU is basically brand new).

savely-krasovsky · 2025-11-02T10:01:58Z

Because I bumped numpy to the latest version (due to an onnxruntime update), it became incompatible since numpy now requires Python 3.12. The ROCm build is based on Ubuntu 22.04, which only has Python 3.10, so I decided to bump the ROCm base to Ubuntu 24.04. After a lot of trial and error, I finally fixed the ROCm build by bumping both the base to Ubuntu 24.04 and its ONNX Runtime to 1.21.1 - the latest version that still supports the ROCm Execution Provider. In the next releases it's removed, and they advise migrating to the MIGraphX EP. I also had to add cache support, since otherwise I would have been debugging those failing builds for God knows how long, lol. I also removed the patch that sets another set of CMAKE_HIP_ARCHITECTURES and moved it to the Dockerfile; now it's much simpler to control.

savely-krasovsky · 2025-11-02T21:00:33Z

GPU works fine for 5-10 minutes and then immich-machine-learning starts to spam this into logs and GPU fans go brrr.

I believe it's some driver issue in the end, setting MACHINE_LEARNING_MAX_BATCH_SIZE__TEXT_RECOGNITION=3 helps to mitigate issue partly. VRAM OOM cause some kind of driver resets, idk.

mertalev · 2025-11-04T00:31:26Z

Testing on a 155H processor looks promising so far on CPU and OpenVINO. I haven't encountered the error you mentioned after many runs with different settings, though this test library only has a dozen or so assets. Will do more testing with a larger library later, as well as on ROCm.

savely-krasovsky · 2025-11-04T00:44:01Z

@mertalev with Arc Pro B50, it straight up leaks GPU memory from the start of the job, and after about 2–3 thousand pictures processed, it reaches the VRAM limit (16 GB in my case) and throws a CL_OUT_OF_RESOURCES exception with the following description:

Due to a driver bug, any subsequent OpenCL API call may cause the application to hang, so the GPU plugin may be unable to finish correctly.
The CL_OUT_OF_RESOURCES error typically occurs in two cases:

An actual lack of memory for the current inference.

An out-of-bounds access to GPU memory from a kernel.
For case 1, you may try adjusting some model parameters (e.g., use a smaller batch size, lower inference precision, fewer streams, etc.) to reduce the required memory size. For case 2, please submit a bug report to the OpenVINO team.
Additionally, please try updating the driver to the latest version.

This means the problem is in the compute-runtime or xe driver (there is no option to use i915 for Battlemage GPUs since it doesn't support them).

savely-krasovsky added 2 commits November 1, 2025 03:12

feat: update openvino and onnxruntime

6dd73ce

feat: avoid forcing 3.13

66bd84a

savely-krasovsky requested a review from mertalev as a code owner November 1, 2025 02:46

savely-krasovsky marked this pull request as draft November 1, 2025 02:46

immich-push-o-matic bot added the 🧠machine-learning label Nov 1, 2025

mertalev reviewed Nov 1, 2025

View reviewed changes

machine-learning/pyproject.toml Outdated Show resolved Hide resolved

machine-learning/Dockerfile Show resolved Hide resolved

feat: add legacy driver

330f176

savely-krasovsky force-pushed the main branch from f6d9707 to 330f176 Compare November 1, 2025 03:33

fix: binary incompatibility fix

464bf0c

savely-krasovsky force-pushed the main branch from f3fc522 to 464bf0c Compare November 1, 2025 03:46

Merge branch 'main' into main

258a27d

savely-krasovsky changed the title ~~[WIP] feat: update onnxruntime and openvino stack~~ feat: update onnxruntime and openvino stack to support Intel Battlemage GPUs Nov 1, 2025

savely-krasovsky force-pushed the main branch from 981aeb3 to ec96665 Compare November 1, 2025 14:04

savely-krasovsky changed the title ~~feat: update onnxruntime and openvino stack to support Intel Battlemage GPUs~~ feat(ml): update onnxruntime and openvino stack to support Intel Battlemage GPUs Nov 1, 2025

savely-krasovsky force-pushed the main branch 2 times, most recently from 393583c to 3ecc0df Compare November 1, 2025 15:04

fix: address rocm failing build

b529019

savely-krasovsky force-pushed the main branch from 3ecc0df to b529019 Compare November 1, 2025 15:25

feat: rocm build cache support

2d57bda

savely-krasovsky force-pushed the main branch from d0d6917 to 2d57bda Compare November 2, 2025 04:42

Merge branch 'main' into main

f137537

savely-krasovsky marked this pull request as ready for review November 2, 2025 10:03

savely-krasovsky requested a review from bo0tzz as a code owner November 2, 2025 10:03

savely-krasovsky changed the title ~~feat(ml): update onnxruntime and openvino stack to support Intel Battlemage GPUs~~ feat(ml): update ONNX Runtime, OpenVINO and ROCm stack Nov 2, 2025

Merge branch 'main' into main

895cae5

savely-krasovsky requested a review from mertalev November 2, 2025 23:53

Merge branch 'main' into main

354580d

mertalev added the changelog:enhancement label Nov 3, 2025

savely-krasovsky mentioned this pull request Nov 3, 2025

[Bug]: GPU Plugin crash on Arc Pro B50 openvinotoolkit/openvino#32665

Open

3 tasks

Merge branch 'main' into main

ebb2b89

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

feat(ml): update ONNX Runtime, OpenVINO and ROCm stack #23458

feat(ml): update ONNX Runtime, OpenVINO and ROCm stack #23458

savely-krasovsky commented Nov 1, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

savely-krasovsky commented Nov 1, 2025 •

edited

Loading

Uh oh!

mertalev commented Nov 1, 2025

Uh oh!

savely-krasovsky commented Nov 1, 2025

Uh oh!

savely-krasovsky commented Nov 1, 2025 •

edited

Loading

Uh oh!

savely-krasovsky commented Nov 2, 2025

Uh oh!

savely-krasovsky commented Nov 2, 2025 •

edited

Loading

Uh oh!

mertalev commented Nov 4, 2025

Uh oh!

savely-krasovsky commented Nov 4, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

feat(ml): update ONNX Runtime, OpenVINO and ROCm stack #23458

Are you sure you want to change the base?

feat(ml): update ONNX Runtime, OpenVINO and ROCm stack #23458

Conversation

savely-krasovsky commented Nov 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

How Has This Been Tested?

Screenshots (if appropriate)

Checklist:

Please describe to which degree, if any, an LLM was used in creating this pull request.

Uh oh!

Uh oh!

Uh oh!

savely-krasovsky commented Nov 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mertalev commented Nov 1, 2025

Uh oh!

savely-krasovsky commented Nov 1, 2025

Uh oh!

savely-krasovsky commented Nov 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

savely-krasovsky commented Nov 2, 2025

Uh oh!

savely-krasovsky commented Nov 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mertalev commented Nov 4, 2025

Uh oh!

savely-krasovsky commented Nov 4, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

savely-krasovsky commented Nov 1, 2025 •

edited

Loading

savely-krasovsky commented Nov 1, 2025 •

edited

Loading

savely-krasovsky commented Nov 1, 2025 •

edited

Loading

savely-krasovsky commented Nov 2, 2025 •

edited

Loading