You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I've been trying to use SGLang engine for inference on fine-tuned VLMs for entity extraction. I've noticed an issue with the accuracy of these models when they are served through SGLang vs when using huggingface transformers library. For certain entities the difference is as much as 30%. I've been using the same test dataset in both cases. The input arguments(temperature, max_tokens, do_sample etc) are identical. I'm running SGLang via a docker image on runpod serverless.
This issue exists with both Gemma3-12B and Qwen2.5-VL. I've tried disabling radix_cache and cuda_graph but there is no change in the output. I've also confirmed that the image is passed in the same way(ENABLE_MULTIMODAL and TRUST_REMOTE_CODE flags are set to True).
I've also tried splitting the number of entities I'm asking into seperate requests(Initially I was passing 17 entities, now I'm passing them 4 at a time), however there is not much difference in the accuracy(5-7% improvement MAX).
I have also tried to bypass SGLang's engine and use transformers backend directly by setting MODEL_IMPL = transformers however this errors out for both Gemma and Qwen(input_ids are not passed during generation)
Is there something that I am missing here? I'm using the latest versions of SGLang and transformers.
Note:
For the finetuning I've been using LoRA based SFT. Once training is complete I merge the adapter weights with the base model and upload the entire model to huggingface. I then download these weights and load them on a separate GPU using huggingface transformers for inference for my local testing. I do not use SGLang here and the accuracy results are very good. Once I run inference for the same model on SGLang the above issue occurs
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
Uh oh!
There was an error while loading. Please reload this page.
-
Hi Team,
I've been trying to use SGLang engine for inference on fine-tuned VLMs for entity extraction. I've noticed an issue with the accuracy of these models when they are served through SGLang vs when using huggingface transformers library. For certain entities the difference is as much as 30%. I've been using the same test dataset in both cases. The input arguments(temperature, max_tokens, do_sample etc) are identical. I'm running SGLang via a docker image on runpod serverless.
This issue exists with both
Gemma3-12BandQwen2.5-VL. I've tried disablingradix_cacheandcuda_graphbut there is no change in the output. I've also confirmed that the image is passed in the same way(ENABLE_MULTIMODALandTRUST_REMOTE_CODEflags are set to True).I've also tried splitting the number of entities I'm asking into seperate requests(Initially I was passing 17 entities, now I'm passing them 4 at a time), however there is not much difference in the accuracy(5-7% improvement MAX).
I have also tried to bypass SGLang's engine and use transformers backend directly by setting MODEL_IMPL = transformers however this errors out for both Gemma and Qwen(input_ids are not passed during generation)
Is there something that I am missing here? I'm using the latest versions of SGLang and transformers.
Note:
For the finetuning I've been using LoRA based SFT. Once training is complete I merge the adapter weights with the base model and upload the entire model to huggingface. I then download these weights and load them on a separate GPU using huggingface transformers for inference for my local testing. I do not use SGLang here and the accuracy results are very good. Once I run inference for the same model on SGLang the above issue occurs
Beta Was this translation helpful? Give feedback.
All reactions