Skip to content

Conversation

@savely-krasovsky
Copy link
Contributor

@savely-krasovsky savely-krasovsky commented Nov 1, 2025

Description

I noticed that Intel Battlemage GPUs are not supported by Immich. This means that, unfortunately, owners of B570, B580, Arc Pro B50, and B60 cannot use GPU acceleration with Immich.

Since the B50 debuted only in Sep 2025, I had to update the OpenVINO image from Bookworm to Trixie (otherwise the .deb packages from intel-graphics-compiler and compute-runtime wouldn't install) and update Python from 3.11 to 3.13. This allowed me to use the newest onnx, onnxruntime-gpu, and onnxruntime-openvino with Arc Pro support.

Due to that dependency bump I also had to refresh ROCm stack.

Fixes #21190.

How Has This Been Tested?

I tested it on personal setup with Arc Pro B50, unfortunately I don't have another hardware to test regress with CUDA/ROCm and other providers.

  • Tested various ML scenarios with Arc Pro B50 GPU.
  • Tested Nvidia/AMD and other SOCs.

Screenshots (if appropriate)

Checklist:

  • I have performed a self-review of my own code
  • I have made corresponding changes to the documentation if applicable
  • I have no unrelated changes in the PR.
  • I have confirmed that any new dependencies are strictly necessary.
  • I have written tests for new code (if applicable)
  • I have followed naming conventions/patterns in the surrounding code
  • All code in src/services/ uses repositories implementations for database calls, filesystem operations, etc.
  • All code in src/repositories/ is pretty basic/simple and does not have any immich specific logic (that belongs in src/services/)

Please describe to which degree, if any, an LLM was used in creating this pull request.

Was not used at all.

@savely-krasovsky savely-krasovsky changed the title [WIP] feat: update onnxruntime and openvino stack feat: update onnxruntime and openvino stack to support Intel Battlemage GPUs Nov 1, 2025
@savely-krasovsky
Copy link
Contributor Author

savely-krasovsky commented Nov 1, 2025

@mertalev I noticed strange behaivior while testing this PR: GPU works fine for 5-10 minutes and then immich-machine-learning starts to spam this into logs and GPU fans go brrr. Though I tested it only with OCR. I am not sure if it's related to my PR. If I understand correctly GPU cannot do multiple tasks at the same time, but machine-learning service continue to feed it anyway.

I can open separate issue if necessary.

@mertalev
Copy link
Member

mertalev commented Nov 1, 2025

It should be able to do multiple tasks at the same time. I haven't seen this before, so I think it's likely related to the ocl driver change in this PR. Not sure if this is a known upstream issue or what the root cause is.

@savely-krasovsky
Copy link
Contributor Author

@mertalev okay, I will do some research during weekend.

@savely-krasovsky
Copy link
Contributor Author

savely-krasovsky commented Nov 1, 2025

But I would test it anyway, especially with older hardware (even B570/B580 could be considered older, since I use Arc Pro B50 and for now I believe it's some driver bug, since GPU is basically brand new).

@savely-krasovsky savely-krasovsky changed the title feat: update onnxruntime and openvino stack to support Intel Battlemage GPUs feat(ml): update onnxruntime and openvino stack to support Intel Battlemage GPUs Nov 1, 2025
@savely-krasovsky savely-krasovsky force-pushed the main branch 2 times, most recently from 393583c to 3ecc0df Compare November 1, 2025 15:04
@savely-krasovsky
Copy link
Contributor Author

Because I bumped numpy to the latest version (due to an onnxruntime update), it became incompatible since numpy now requires Python 3.12. The ROCm build is based on Ubuntu 22.04, which only has Python 3.10, so I decided to bump the ROCm base to Ubuntu 24.04. After a lot of trial and error, I finally fixed the ROCm build by bumping both the base to Ubuntu 24.04 and its ONNX Runtime to 1.21.1 - the latest version that still supports the ROCm Execution Provider. In the next releases it's removed, and they advise migrating to the MIGraphX EP. I also had to add cache support, since otherwise I would have been debugging those failing builds for God knows how long, lol. I also removed the patch that sets another set of CMAKE_HIP_ARCHITECTURES and moved it to the Dockerfile; now it's much simpler to control.

@savely-krasovsky savely-krasovsky marked this pull request as ready for review November 2, 2025 10:03
@savely-krasovsky savely-krasovsky changed the title feat(ml): update onnxruntime and openvino stack to support Intel Battlemage GPUs feat(ml): update ONNX Runtime, OpenVINO and ROCm stack Nov 2, 2025
@savely-krasovsky
Copy link
Contributor Author

savely-krasovsky commented Nov 2, 2025

GPU works fine for 5-10 minutes and then immich-machine-learning starts to spam this into logs and GPU fans go brrr.

I believe it's some driver issue in the end, setting MACHINE_LEARNING_MAX_BATCH_SIZE__TEXT_RECOGNITION=3 helps to mitigate issue partly. VRAM OOM cause some kind of driver resets, idk.

@mertalev
Copy link
Member

mertalev commented Nov 4, 2025

Testing on a 155H processor looks promising so far on CPU and OpenVINO. I haven't encountered the error you mentioned after many runs with different settings, though this test library only has a dozen or so assets. Will do more testing with a larger library later, as well as on ROCm.

@savely-krasovsky
Copy link
Contributor Author

@mertalev with Arc Pro B50, it straight up leaks GPU memory from the start of the job, and after about 2–3 thousand pictures processed, it reaches the VRAM limit (16 GB in my case) and throws a CL_OUT_OF_RESOURCES exception with the following description:

Due to a driver bug, any subsequent OpenCL API call may cause the application to hang, so the GPU plugin may be unable to finish correctly.
The CL_OUT_OF_RESOURCES error typically occurs in two cases:

  1. An actual lack of memory for the current inference.
  2. An out-of-bounds access to GPU memory from a kernel.
    For case 1, you may try adjusting some model parameters (e.g., use a smaller batch size, lower inference precision, fewer streams, etc.) to reduce the required memory size. For case 2, please submit a bug report to the OpenVINO team.
    Additionally, please try updating the driver to the latest version.

This means the problem is in the compute-runtime or xe driver (there is no option to use i915 for Battlemage GPUs since it doesn't support them).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants