Skip to content

Commit e125dff

Browse files
authored
Try more offers when starting a job (#2387)
Increase the number of offers tried from 15 to 25. Also make this configurable. In particular, this is needed for RunPod Community Cloud spot offers, because availability detection is imprecise there, so many offers may turn out to be unavailable. The increased limit can be seen as a temporary measure until we improve the availability detection in RunPod.
1 parent 03ceb16 commit e125dff

File tree

3 files changed

+6
-1
lines changed

3 files changed

+6
-1
lines changed

docs/docs/reference/environment-variables.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -112,6 +112,8 @@ For more details on the options below, refer to the [server deployment](../guide
112112

113113
* `DSTACK_SERVER_ROOT_LOG_LEVEL` – Sets root logger log level. Defaults to `ERROR`.
114114
* `DSTACK_SERVER_UVICORN_LOG_LEVEL` – Sets uvicorn logger log level. Defaults to `ERROR`.
115+
* `DSTACK_SERVER_MAX_OFFERS_TRIED` - Sets how many instance offers to try when starting a job.
116+
Setting a high value can degrade server performance.
115117
* `DSTACK_RUNNER_VERSION` – Sets exact runner version for debug. Defaults to `latest`. Ignored if `DSTACK_RUNNER_DOWNLOAD_URL` is set.
116118
* `DSTACK_RUNNER_DOWNLOAD_URL` – Overrides `dstack-runner` binary download URL.
117119
* `DSTACK_SHIM_DOWNLOAD_URL` – Overrides `dstack-shim` binary download URL.

src/dstack/_internal/server/background/tasks/process_submitted_jobs.py

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -35,6 +35,7 @@
3535
)
3636
from dstack._internal.core.models.volumes import Volume
3737
from dstack._internal.core.services.profiles import get_termination
38+
from dstack._internal.server import settings
3839
from dstack._internal.server.db import get_db, get_session_ctx
3940
from dstack._internal.server.models import (
4041
FleetModel,
@@ -452,7 +453,7 @@ async def _run_job_on_new_instance(
452453
)
453454
# Limit number of offers tried to prevent long-running processing
454455
# in case all offers fail.
455-
for backend, offer in offers[:15]:
456+
for backend, offer in offers[: settings.MAX_OFFERS_TRIED]:
456457
logger.debug(
457458
"%s: trying %s in %s/%s for $%0.4f per hour",
458459
fmt(job_model),

src/dstack/_internal/server/settings.py

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -31,6 +31,8 @@
3131
DB_POOL_SIZE = int(os.getenv("DSTACK_DB_POOL_SIZE", 10))
3232
DB_MAX_OVERFLOW = int(os.getenv("DSTACK_DB_MAX_OVERFLOW", 10))
3333

34+
MAX_OFFERS_TRIED = int(os.getenv("DSTACK_SERVER_MAX_OFFERS_TRIED", 25))
35+
3436
SERVER_CONFIG_DISABLED = os.getenv("DSTACK_SERVER_CONFIG_DISABLED") is not None
3537
SERVER_CONFIG_ENABLED = not SERVER_CONFIG_DISABLED
3638

0 commit comments

Comments
 (0)