The Oshepherd guiding the Ollama(s) inference orchestration.
A centralized FastAPI service, using Celery and Redis to orchestrate multiple Ollama servers as workers.
pip install oshepherd-
Setup Redis:
Celery uses Redis as message broker and backend. You'll need a Redis instance, which you can provision for free in redislabs.com.
-
Setup FastAPI Server:
# define configuration env file # use credentials for redis as broker and backend cp .api.env.template .api.env # start api oshepherd start-api --env-file .api.env
-
Setup Celery/Ollama Worker(s):
# install ollama https://ollama.com/download # optionally pull the model ollama pull mistral # define configuration env file # use credentials for redis as broker and backend cp .worker.env.template .worker.env # start worker oshepherd start-worker --env-file .worker.env
-
Now you're ready to execute Ollama completions remotely. You can point your Ollama client to your oshepherd api server by setting the
host, and it will return your requested completions from any of the workers:- ollama-python client:
import ollama client = ollama.Client(host="http://127.0.0.1:5001") # Standard request response = client.generate(model="mistral", prompt="Why is the sky blue?") # Streaming request for chunk in client.generate(model="mistral", prompt="Why is the sky blue?", stream=True): print(chunk['response'], end='', flush=True)
For a complete Python example with streaming support, see examples/pretty_streaming.py.
- ollama-js client:
import { Ollama } from "ollama/browser"; const ollama = new Ollama({ host: "http://127.0.0.1:5001" }); // Standard request const response = await ollama.generate({ model: "mistral", prompt: "Why is the sky blue?", }); // Streaming request const streamResponse = await ollama.generate({ model: "mistral", prompt: "Why is the sky blue?", stream: true }); for await (const chunk of streamResponse) { process.stdout.write(chunk.response); }
For a complete TypeScript/JavaScript example with streaming support, see examples/ts-scripts/README.md.
- Raw http request:
curl -X POST -H "Content-Type: application/json" -L http://127.0.0.1:5001/api/generate/ \ -d '{"model":"mistral","prompt":"Why is the sky blue?","stream":true}' \ --no-buffer
This package is in alpha, its architecture and api might change in the near future. Currently this is getting tested in a controlled environment by real users, but haven't been audited, nor tested thorugly. Use it at your own risk.
As this is an alpha version, support and responses might be limited. We'll do our best to address questions and issues as quickly as possible.
- Generate a completion:
POST /api/generate - Generate a chat completion:
POST /api/chat - Generate Embeddings:
POST /api/embeddings - List Local Models:
GET /api/tags - Version:
GET /api/version - Show Model Information:
POST /api/show - List Running Models:
GET /api/ps
Oshepherd API server currently supports the endpoints listed above, enabling full compatibility with official Ollama clients (i.e.: ollama-python, ollama-js). These endpoints provide comprehensive functionality for the most common use cases. Additional endpoints from the official Ollama API are not planned for the near future. For more details on the full Ollama API specifications, refer to the Ollama API documentation.
We welcome contributions! If you find a bug or have suggestions for improvements, please open an issue or submit a pull request pointing to development branch. Before creating a new issue/pull request, take a moment to search through the existing issues/pull requests to avoid duplicates.
To run and build locally you can use conda:
conda create -n oshepherd python=3.12
conda activate oshepherd
pip install -r requirements.txt
# install oshepherd
pip install -e .Follow usage instructions to start api server and celery worker using a local ollama, and then run the tests:
pytest -s tests/This is a project developed and maintained by mnemonica.ai.
