Releases: dstackai/dstack
0.18.9
Base Docker image with nvcc
If you don't specify a custom Docker image, dstack uses its own base image with essential CUDA drivers, python, pip, and conda (Miniforge). Previously, this image didn't include nvcc, needed for compiling custom CUDA kernels (e.g., Flash Attention).
With version 0.18.9, you can now include nvcc.
type: task
python: "3.10"
# This line ensures `nvcc` is included into the base Docker image
nvcc: true
commands:
- pip install -r requirements.txt
- python train.py
resources:
gpu: 24GBEnvironment variables for on-prem fleets
When you create an on-prem fleet, it's now possible to pre-configure environment variables. These variables will be used when installing the dstack-shim service on hosts and running workloads.
For example, these environment variables can be used to configure dstack to use a proxy:
type: fleet
name: my-fleet
placement: cluster
env:
- HTTP_PROXY=http://proxy.example.com:80
- HTTPS_PROXY=http://proxy.example.com:80
- NO_PROXY=localhost,127.0.0.1
ssh_config:
user: ubuntu
identity_file: ~/.ssh/id_rsa
hosts:
- 3.255.177.51
- 3.255.177.52Examples
New examples include:
Other
- [Bugifx] Fix filtering offers by disk size by @jvstme in #1517
- [Bugifx] Run containers as root for all images by @r4victor in #1499
- [Docs] Document GCP permissions for volumes by @r4victor in #1501
- [Docs] Another batch of docs improvements #1497 by @peterschmidt85 in #1498
- [Bugfix] Fix creating TensorDock instances by @jvstme in #1506
- [Bugfix] Launch TensorDock instances with correct disk size by @jvstme in #1508
- [Bugfix] Set timeouts to TensorDock API requests by @jvstme in #1509
- [Docs] Update TensorDock setup instructions by @jvstme in #1512
- [Internal] Implement API endpoint for listing volumes across projects by @r4victor in #1519
- [Internal] Include Volume.deleted in the API by @r4victor in #1520
- [Docs] Update the Axolotl example #1493 by @peterschmidt85 in #1494
- [Internal] Print docker image pulling errors to
shim.logby @jvstme in #1503 - [Feature] Add
envsetting to fleet config for on-prem fleets by @un-def in #1505
Full changelog: 0.18.8...0.18.9
0.18.8
GCP volumes
#1477 added support for gcp volumes:
type: volume
name: my-gcp-volume
backend: gcp
region: europe-west1
size: 100GB
Previously, volumes were only supported for aws and runpod.
Major bugfixes
#1486 fixed a major bug introduced in 0.18.7 that could lead to instances not being terminated in the cloud.
Other
- Update Alignment Handbook example by @peterschmidt85 in #1475
- Add automatic generation of examples documentation by @peterschmidt85 in #1485
- Start dstack-shim service after network-online by @un-def in #1480
- Remove host_info.json on instance deploy if exists by @un-def in #1479
- Fix broken user token rotation API by @r4victor in #1487
New Contributors
Full Changelog: 0.18.7...0.18.8
0.18.7
Fleets
With fleets, you can now describe clusters declaratively and create them in both cloud and on-prem with a single command. Once a fleet is created, it can be used with dev environments, tasks, and services.
Cloud fleets
To provision a fleet in the cloud, specify the required resources, number of nodes, and other optional parameters.
type: fleet
name: my-fleet
placement: cluster
nodes: 2
resources:
gpu: 24GBOn-prem fleets
To create a fleet from on-prem servers, specify their hosts along with the user, port, and SSH key for connection via SSH.
type: fleet
name: my-fleet
placement: cluster
ssh_config:
user: ubuntu
identity_file: ~/.ssh/id_rsa
hosts:
- 3.255.177.51
- 3.255.177.52To create or update the fleet, simply call the dstack apply command:
dstack apply -f examples/fleets/my-fleet.dstack.ymlLearn more about fleets in the documentation.
Deprecating dstack run
Now that we support dstack apply for gateways, volumes, and fleets, we have extended this support to dev environments, tasks, and services. Instead of using dstack run WORKING_DIR -f CONFIG_FILE, you can now use dstack apply -f CONFIG_FILE.
Also, it's now possible to specify a name for dev environments, tasks, and services, just like for gateways, volumes, and fleets.
type: dev-environment
name: my-ide
python: "3.11"
ide: vscode
resources:
gpu: 80GB
This name is used as a run name and is more convenient than a random name. However, if you don't specify a name, dstack will assign a random name as before.
RunPod Volumes
In other news, we've added support for volumes in the runpod backend. Previously, they were only supported in the aws backend.
type: volume
name: my-new-volume
backend: runpod
region: ca-mtl-3
size: 100GBA great feature of the runpod's volumes is their ability to attach to multiple instances simultaneously. This allows for persisting cache across multiple service replicas or supporting distributed training tasks.
Major bugfixes
Important
This update fixes the broken kubernetes backend, which has been non-functional since a few previous updates.
Other
- [UX] Make
--gpuoverride YAML'sgpuby @r4victor in #1455
#1431 - [Performance] Speed up listing runs for Python API and CLI by @r4victor in #1430
- [Performance] Speed up project loading by @r4victor in #1425
- [Bugfix] Remove
busyoffers from the top of offers list by @jvstme in #1452 - [Bugfix] Prioritize cheaper offers from the pool by @jvstme in #1453
- [Bugfix] Fix spot offers suggested for on-demand dev envs by @jvstme in #1450
- [Feature] Implement
dstack volume deleteby @r4victor in #1434 - [UX] Instances were always shown as
provisioningfor container backends by @r4victor in * [Docs] Fix typos by @jvstme in #1426 - [Docs] Fix a bad link by @tamanobi in #1422
- [Internal] Add
DSTACK_SENTRY_PROFILES_SAMPLE_RATEby @r4victor in #1428 - [Internal] Update
ruffto0.5.3by @jvstme in #1421 - [Internal] Update GitHub Actions dependencies by @jvstme in #1436
- [UX] Make
--gpuoverride YAML'sgpu: by @r4victor in #1455 - [Bugfix] Respect
regionsforrunpodby @r4victor in #1460
New contributors
** Full changelog**: 0.18.6...0.18.7
0.18.7rc2
This is a preview build of the upcoming 0.18.7 update, bringing a few major new features and many bug fixes.
Fleets
Important
With fleets, you can now describe clusters declaratively and create them in both cloud and on-prem with a single command. Once a fleet is created, it can be used with dev environments, tasks, and services.
Cloud fleets
To provision a fleet in the cloud, specify the required resources, number of nodes, and other optional parameters.
type: fleet
name: my-fleet
placement: cluster
nodes: 2
resources:
gpu: 24GBOn-prem fleets
To create a fleet from on-prem servers, specify their hosts along with the user, port, and SSH key for connection via SSH.
type: fleet
name: my-fleet
placement: cluster
ssh_config:
user: ubuntu
identity_file: ~/.ssh/id_rsa
hosts:
- 3.255.177.51
- 3.255.177.52To create or update the fleet, simply call the dstack apply command:
dstack apply -f examples/fleets/my-fleet.dstack.ymlLearn more about fleets in the documentation.
Deprecating dstack run
Important
Now that we support dstack apply for gateways, volumes, and fleets, we have extended this support to dev environments, tasks, and services. Instead of using dstack run WORKING_DIR -f CONFIG_FILE, you can now use dstack apply -f CONFIG_FILE.
Also, it's now possible to specify a name for dev environments, tasks, and services, just like for gateways, volumes, and fleets.
type: dev-environment
name: my-ide
python: "3.11"
ide: vscode
resources:
gpu: 80GB
This name is used as a run name and is more convenient than a random name. However, if you don't specify a name, dstack will assign a random name as before.
RunPod Volumes
Important
In other news, we've added support for volumes in the runpod backend. Previously, they were only supported in the aws backend.
type: volume
name: my-new-volume
backend: runpod
region: ca-mtl-3
size: 100GBA great feature of the runpod's volumes is their ability to attach to multiple instances simultaneously. This allows for persisting cache across multiple service replicas or supporting distributed training tasks.
Major bugfixes
Important
This update fixes the broken kubernetes backend, which has been non-functional since a few previous updates.
Other
- [UX] Make
--gpuoverride YAML'sgpuby @r4victor in #1455
#1431 - [Performance] Speed up listing runs for Python API and CLI by @r4victor in #1430
- [Performance] Speed up project loading by @r4victor in #1425
- [Bugfix] Remove
busyoffers from the top of offers list by @jvstme in #1452 - [Bugfix] Prioritize cheaper offers from the pool by @jvstme in #1453
- [Bugfix] Fix spot offers suggested for on-demand dev envs by @jvstme in #1450
- [Feature] Implement
dstack volume deleteby @r4victor in #1434 - [UX] Instances were always shown as
provisioningfor container backends by @r4victor in * [Docs] Fix typos by @jvstme in #1426 - [Docs] Fix a bad link by @tamanobi in #1422
- [Internal] Add
DSTACK_SENTRY_PROFILES_SAMPLE_RATEby @r4victor in #1428 - [Internal] Update
ruffto0.5.3by @jvstme in #1421 - [Internal] Update GitHub Actions dependencies by @jvstme in #1436
New contributors
** Full changelog**: 0.18.6...0.18.7rc2
0.18.6
Major fixes
- Support for GitLab's authorization when the repo is using HTTP/HTTPS by @jvstme in #1412
- Add a multi-node example to the Hugging Alignment Handbook example by @deep-diver in #1409
- Fix the issue where idle instances weren't offered (occurred when a GPU name was in upper case). by @jvstme in #1417
- Fix the issue where an exception is thrown for non-standard Git repo host URLs using HTTP/HTTPS @jvstme in #1410
- Support
H100with thegcpbackend by @jvstme in #1405
Warning
If you have idle instances in your pool, it is recommended to re-create them after upgrading to version 0.18.6. Otherwise, there is a risk that these instances won't be able to execute jobs.
Other
- [Internal] Add script for checking OCI images by @jvstme in #1408
- Fix repos migration on PostgreSQL by @jvstme in #1414
- [Internal] Fix
dstack-runnerrepo tests by @jvstme in #1418 - Fix OCI listing not found errors by @jvstme in #1407
Full changelog: 0.18.5...0.18.6
0.18.5
Read below about its new features and bug-fixes.
Volumes
When you run anything with dstack, it allows you to configure the disk size. However, once the run is finished, if you haven't stored your data in any external storage, all the data on disk will be erased. With 0.18.5, we're adding support for network volumes that allow data to persist across runs.
Once you've created a volume (e.g. named my-new-volume), you can attach it to a dev environment, task, or service.
type: dev-environment
ide: vscode
volumes:
- name: my-new-volume
path: /volume_dataThe data stored in the volume will persist across runs.
dstack allows you to create new volumes and register existing ones. To learn more about how volumes work, check out the docs.
Important
Volumes are currently experimental and only work with the aws backend. Support for other backends is coming soon.
PostgreSQL
By default, dstack stores its state in ~/.dstack/server/data using SQLite. With this update, it's now possible to configure dstack to store its state in PostgreSQL. Just pass the DSTACK_DATABASE_URL environment variable.
DSTACK_DATABASE_URL="postgresql+asyncpg://myuser:mypassword@localhost:5432/mydatabase" dstack server
Important
Despite PostgreSQL support, dstack still requires that you run only one instance of the dstack server. However, this requirement will be lifted in a future update.
On-prem clusters
Previously, dstack didn't allow the use of on-prem clusters (added via dstack pool add-ssh) if there were no backends configured. This update fixes that bug. Now, you don't have to configure any backends if you only plan to use on-prem clusters.
Supported GPUs
Previously, dstack didn't support L4 and H100 GPUs with AWS. Now you can use them.
Full changelog
- Support dstack volumes by @r4victor in #1364
- Filter pool instances with respect to volumes availability zone by @r4victor in #1368
- Support AWS L4 GPU by @jvstme in #1365
- Add Concepts->Volumes by @r4victor in #1370
- Improve Overview page by @r4victor in #1377
- Add volumes prices by @r4victor in #1382
- Wait for GCP VM no capacity error by @r4victor in #1387
- Disallow mounting volumes inside /workflow by @r4victor in #1388
- Support NVIDIA NVSwitch in
dstackVM images by @jvstme in #1389 - Optimize loading
dstackDocker images by @jvstme in #1391 - Improve Contributing by @r4victor in #1392
- Support running dstack server with Postgres by @r4victor in #1398
- Support H100 GPU on AWS by @jvstme in #1394
- Fix possible server freeze after
pool add-sshby @jvstme in #1396 - Add OCI eu-milan-1 region by @jvstme in #1400
- Prepare future OCI spot instances support by @jvstme in #1401
- Remove if backends configured check by @r4victor in #1404
- Include project_name in Instance and Volume by @r4victor in #1390
See more: 0.18.4...0.18.5
0.18.5rc1
This is a release candidate build of the upcoming 0.18.5 release. Read below to learn about its new features and bug-fixes.
Volumes
When you run anything with dstack, it allows you to configure the disk size. However, once the run is finished, if you haven't stored your data in any external storage, all the data on disk will be erased. With 0.18.5, we're adding support for network volumes that allow data to persist across runs.
Once you've created a volume (e.g. named my-new-volume), you can attach it to a dev environment, task, or service.
type: dev-environment
ide: vscode
volumes:
- name: my-new-volume
path: /volume_dataThe data stored in the volume will persist across runs.
dstack allows you to create new volumes and register existing ones. To learn more about how volumes work, check out the docs.
Important
Volumes are currently experimental and only work with the aws backend. Support for other backends is coming soon.
PostgreSQL
By default, dstack stores its state in /root/.dstack/server/data using SQLite. With this update, it's now possible to configure dstack to store its state in PostgreSQL. Just pass the DSTACK_DATABASE_URL environment variable.
DSTACK_DATABASE_URL="postgresql+asyncpg://myuser:mypassword@localhost:5432/mydatabase" dstack server
Important
Despite PostgreSQL support, dstack still requires that you run only one instance of the dstack server. However, this requirement will be lifted in a future update.
On-prem clusters
Previously, dstack didn't allow the use of on-prem clusters (added via dstack pool add-ssh) if there were no backends configured. This update fixes that bug. Now, you don't have to configure any backends if you only plan to use on-prem clusters.
Supported GPUs
Previously, dstack didn't support L4 and H100 GPUs with AWS. Now you can use them.
Full changelog
- Support dstack volumes by @r4victor in #1364
- Filter pool instances with respect to volumes availability zone by @r4victor in #1368
- Support AWS L4 GPU by @jvstme in #1365
- Add Concepts->Volumes by @r4victor in #1370
- Improve Overview page by @r4victor in #1377
- Add volumes prices by @r4victor in #1382
- Wait for GCP VM no capacity error by @r4victor in #1387
- Disallow mounting volumes inside /workflow by @r4victor in #1388
- Support NVIDIA NVSwitch in
dstackVM images by @jvstme in #1389 - Optimize loading
dstackDocker images by @jvstme in #1391 - Improve Contributing by @r4victor in #1392
- Support running dstack server with Postgres by @r4victor in #1398
- Support H100 GPU on AWS by @jvstme in #1394
- Fix possible server freeze after
pool add-sshby @jvstme in #1396 - Add OCI eu-milan-1 region by @jvstme in #1400
- Prepare future OCI spot instances support by @jvstme in #1401
- Remove if backends configured check by @r4victor in #1404
- Include project_name in Instance and Volume by @r4victor in #1390
See more: 0.18.4...0.18.5rc1
0.18.4
Google Cloud TPU
This update introduces initial support for Google Cloud TPU.
To request a TPU, specify the TPU architecture prefixed by tpu- (in gpu under resources):
type: task
python: "3.11"
commands:
- pip install torch~=2.3.0 torch_xla[tpu]~=2.3.0 torchvision -f https://storage.googleapis.com/libtpu-releases/index.html
- git clone --recursive https://github.com/pytorch/xla.git
- python3 xla/test/test_train_mp_imagenet.py --fake_data --model=resnet50 --num_epochs=1
resources:
gpu: tpu-v2-8
Important
Currently, you can't specify other than 8 TPU cores. This means only single TPU device workloads are supported. Support for multiple TPU devices is coming soon.
Private subnets with GCP
Additionally, the update allows configuring the gcp backend to use only private subnets. To achieve this, set public_ips to false.
projects:
- name: main
backends:
- type: gcp
creds:
type: default
public_ips: false
Major bug-fixes
Besides TPU, the update fixes a few important bugs.
- Fix
cudobackend stuck && Improve docs forcudoby @smokfyz in #1347 - Fix
nvidia-sminot available onlambdaby @r4victor in #1357 - Respect
registry_authfor RunPod by @smokfyz in #1333 - Support multi-node tasks on
ociby @jvstme in #1334
Other
- Show warning on required
sshversion by @loghijiaha in #1313 - Add OCI packer templates by @jvstme in #1316
- Support
ociBare Metal instances by @jvstme in #1325 - Support
ociBM.Optimized3.36instance by @jvstme in #1328 - [Docs] Update
dstack pooldocs by @jvstme in #1329 - Add TPU support in
gcpby @Bihan in #1323 - Fix failing
runner-testworkflow by @r4victor in #1336 - Document OCI permissions by @jvstme in #1338
- Limit the gateway's open ports to
22,80, and443by @smokfyz in #1335 - Update
serve.dstack.yml- infinity by @michaelfeil in #1340 - Support instances without public IP for GCP by @smokfyz in #1341
- [Internal] Automate OCI images publishing by @jvstme in #1346
- Fix slow
/api/pools/list_instancesby @r4victor in #1320 - Respect
gcpVPC config when provisioning TPUs by @r4victor in #1332 - [Internal] Fix linter errors by @jvstme in #1322
- TPU support enhancements by @r4victor in #1330
- TPU initial release by @Bihan in #1354
- TPUs fixes by @r4victor in #1360
- Minor refactoring to support custom backends in dstack Sky by @r4victor in #1319
- Even more flexible OCI client credentials by @jvstme in #1317
New contributors
- @loghijiaha made their first contribution in #1313
- @smokfyz made their first contribution in #1333
- @michaelfeil made their first contribution in #1340
Full changelog: 0.18.3...0.18.4
0.18.4rc3
This is a preview build of the upcoming 0.18.4 release. See below to see what's new.
TPU
One of the major new features in this update is the initial support for Google Cloud TPU.
To request a TPU, you simply need to specify the system architecture of the required TPU prefixed by tpu- in gpu:
type: task
python: "3.11"
commands:
- pip install torch~=2.3.0 torch_xla[tpu]~=2.3.0 torchvision -f https://storage.googleapis.com/libtpu-releases/index.html
- git clone --recursive https://github.com/pytorch/xla.git
- python3 xla/test/test_train_mp_imagenet.py --fake_data --model=resnet50 --num_epochs=1
resources:
gpu: tpu-v2-8
Important
You cannot request multiple nodes (for running parallel on multiple TPU devices) for tasks. This feature is coming soon.
You're very welcome to try the initial support and share your feedback.
Major bug-fixes
Besides TPU, the update fixes a few important bugs.
- Fix
cudobackend stuck && Improve docs forcudoby @smokfyz in #1347 - Fix
nvidia-sminot available onlambdaby @r4victor in #1357 - Respect
registry_authfor RunPod by @smokfyz in #1333 - Support multi-node tasks on
ociby @jvstme in #1334
Other
- Show warning on required
sshversion by @loghijiaha in #1313 - Add OCI packer templates by @jvstme in #1316
- Support
ociBare Metal instances by @jvstme in #1325 - Support
ociBM.Optimized3.36instance by @jvstme in #1328 - [Docs] Update
dstack pooldocs by @jvstme in #1329 - Add TPU support in
gcpby @Bihan in #1323 - Fix failing
runner-testworkflow by @r4victor in #1336 - Document OCI permissions by @jvstme in #1338
- Limit the gateway's open ports to
22,80, and443by @smokfyz in #1335 - Update
serve.dstack.yml- infinity by @michaelfeil in #1340 - Support instances without public IP for GCP by @smokfyz in #1341
- [Internal] Automate OCI images publishing by @jvstme in #1346
- Fix slow
/api/pools/list_instancesby @r4victor in #1320 - Respect
gcpVPC config when provisioning TPUs by @r4victor in #1332 - [Internal] Fix linter errors by @jvstme in #1322
- TPU support enhancements by @r4victor in #1330
- TPU initial release by @Bihan in #1354
- TPUs fixes by @r4victor in #1360
- Minor refactoring to support custom backends in dstack Sky by @r4victor in #1319
- Even more flexible OCI client credentials by @jvstme in #1317
New contributors
- @loghijiaha made their first contribution in #1313
- @smokfyz made their first contribution in #1333
- @michaelfeil made their first contribution in #1340
Full changelog: 0.18.3...0.18.4rc3
0.18.3
Oracle Cloud Infrastructure
With the new update, it is now possible to run workloads with your Oracle Cloud Infrastructure (OCI) account. The backend is called oci and can be configured as follows:
projects:
- name: main
backends:
- type: oci
creds:
type: defaultThe supported credential types include default and client. In case default is used, dstack automatically picks the default OCI credentials from ~/.oci/config.
Just like other backends, oci supports dev environments, tasks, and services:
Note
Support for spot instances, multi-node tasks, and gateways is coming soon.
Find more documentation on using Oracle Cloud Infrastructure on the reference page.
Retry policy
We have reworked how to configure the retry policy and how it is applied to runs. Here's an example:
type: task
commands:
- python train.py
retry:
on_events: [no-capacity]
duration: 2h
Now, if you run such a task, dstack will keep trying to find capacity within 2 hours. Once capacity is found, dstack will run the task.
The on_events property also supports error (in case the run fails with an error) and interruption (if the run is using a spot instance and it was interrupted).
Previously, dstack only allowed retries when spot instances were interrupted.
RunPod
Previously, the runpod backend only allowed the use of Docker images with /bin/bash or /bin/sh as the entrypoint. Thanks to the fix on the RunPod's side, dstack now allows the use of any Docker images.
Additionally, the runpod backend now also supports spot instances.
GCP
The gcp backend now also allows configuring VPCs:
projects:
- name: main
backends:
- type: gcp
project_id: my-awesome-project
creds:
type: default
vpc_name: my-custom-vpcThe VPC should belong to the same project. If you would like to use a shared VPC from another project, you can also specify vpc_project_id.
AWS
Last but not least, for the aws backend, it is now possible to configure VPCs for selected regions and use the default VPC in other regions:
projects:
- name: main
backends:
- type: aws
creds:
type: default
vpc_ids:
us-east-1: vpc-0a2b3c4d5e6f7g8h
default_vpcs: true
You just need to set default_vpcs to true.
Other changes
- Fix reverse server-gateway ssh tunnel by @r4victor in #1303
- Respect run filters for the
sshbackend by @r4victor in #1278 - Support resubmitted runs in
dstack runattached mode by @r4victor in #1285 - Do not run jobs on
unreachableinstances by @r4victor in #1286 - Show job termination reason in
dstack ps -vby @r4victor in #1301 - Rename
dstack destroytodstack deleteby @r4victor in #1275 - Prepare OCI backend for release by @jvstme in #1308
- [Docs] Improve the documentation of the Pydantic models #1293 by @peterschmidt85 in #1295
- [Docs] Fix Authorization header by @jvstme in #1305
