Reduced model accuracy when running Vision Language Model inference through SGLang #11599

mark-314e · 2025-10-14T08:09:36Z

mark-314e
Oct 14, 2025

Hi Team,

I've been trying to use SGLang engine for inference on fine-tuned VLMs for entity extraction. I've noticed an issue with the accuracy of these models when they are served through SGLang vs when using huggingface transformers library. For certain entities the difference is as much as 30%. I've been using the same test dataset in both cases. The input arguments(temperature, max_tokens, do_sample etc) are identical. I'm running SGLang via a docker image on runpod serverless.

This issue exists with both Gemma3-12B and Qwen2.5-VL. I've tried disabling radix_cache and cuda_graph but there is no change in the output. I've also confirmed that the image is passed in the same way(ENABLE_MULTIMODAL and TRUST_REMOTE_CODE flags are set to True).

I've also tried splitting the number of entities I'm asking into seperate requests(Initially I was passing 17 entities, now I'm passing them 4 at a time), however there is not much difference in the accuracy(5-7% improvement MAX).

I have also tried to bypass SGLang's engine and use transformers backend directly by setting MODEL_IMPL = transformers however this errors out for both Gemma and Qwen(input_ids are not passed during generation)

Is there something that I am missing here? I'm using the latest versions of SGLang and transformers.

Note:
For the finetuning I've been using LoRA based SFT. Once training is complete I merge the adapter weights with the base model and upload the entire model to huggingface. I then download these weights and load them on a separate GPU using huggingface transformers for inference for my local testing. I do not use SGLang here and the accuracy results are very good. Once I run inference for the same model on SGLang the above issue occurs

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Reduced model accuracy when running Vision Language Model inference through SGLang #11599

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

Uh oh!

Reduced model accuracy when running Vision Language Model inference through SGLang #11599

Uh oh!

mark-314e Oct 14, 2025

Replies: 0 comments

mark-314e
Oct 14, 2025