Replies: 3 comments
-
|
With CPU is expected to have a slower inference time, but you can enable parallel requests by setting the environment variable appropriately: Line 72 in 39a6b56 However that would probably not work very well with CPUs. |
Beta Was this translation helpful? Give feedback.
-
|
Hello @mudler, sorry to bring this up on an old thread, but it is not clear to me whether Thanks! |
Beta Was this translation helpful? Give feedback.
-
|
Hello, So it seems that I cannot get parallel requests to work. The use case is to run two inference tasks in parallel (simulating for example, two users asking to summarize a text from Nextcloud Assistant - see my previous comment). I pass the following environment variables to the LocalAI runtime: During the test I see just one process capping at 200% CPU in top: So this means that only one request is executing at a given time. What should I do to run the tasks in parallel? Should I expect to see 2 ld.so processes each one runing on $LOCALAI_THREADS threads? Is there a way to monitor the running tasks? It would be great to have your feedback, @mudler ! Thanks in advance! |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
I have deploy the latest localAI container (v2.7.0) on a 2 x Xeon 2680V4 (56 threads in total) with 198 GB of RAM, but from what I can tell, the request hitting the /v1/chat/completions endpoint, are being processed one after the other, not in parallel.
I believe there are enough resources on my system to process these requests in parallel.
I am using the openhermes-2.5-mistral-7b.Q8_0.gguf model.
To get a response for the curl example below, it takes about 20+ seconds. Is this good or bad for the system I have?
Thanks
PS: I can't install a GPU on this system as it is a 1U unit.
Curl:
Docker-compose:
LocalAI logs:
Beta Was this translation helpful? Give feedback.
All reactions