Releases: dstackai/dstack
0.19.20
User interface
Logs
This is a hotfix release addressing three major issues related to the UI:
- The UI didn’t display newer AWS CloudWatch logs if there was a long gap between old and new logs.
- Logs received before the 19th appeared as base64-encoded in the UI. The UI now includes a button to decode them automatically.
- Logs were loaded from start to end, which made viewing very slow for long runs.
Note
The dstack logs CLI command may still be affected by the issues above. However, it’s less critical and will be addressed separately.
What's changed
- [chore]: Drop duplicate utility
split_chunksby @jvstme in #2912 - [backends/CloudRift] Fixed issue with terminating inactive instance by @6erun in #2918
- Expose GPU metrics collected by runner as Prometheus metrics by @un-def in #2916
- [UI] Query logs using descending by @peterschmidt85 in #2915
- [UI] Fix logs loading #2892 by @olgenn in #2920
Full changelog: 0.19.19...0.19.20
0.19.19
Fleets
SSH fleets in-place updates
You can now add and remove instances in SSH fleets without recreating the entire fleet.
type: fleet
name: ssh-fleet
ssh_config:
user: dstack
identity_file: ~/.ssh/dstack
hosts:
- 10.0.0.1
- 10.0.0.2$ dstack apply -f fleet.dstack.yml
...
Fleet ssh-fleet does not exist yet.
Create the fleet? [y/n]: y
...
FLEET INSTANCE BACKEND RESOURCES PRICE STATUS CREATED
ssh-fleet 0 ssh (remote) cpu=4 mem=4GB disk=30GB $0 idle 09:08
1 ssh (remote) cpu=2 mem=4GB disk=30GB $0 idle 09:08
Then, if you update the hosts configuration property to
hosts:
#- 10.0.0.1 # removed
- 10.0.0.2
- 10.0.0.3 # addedand apply the same configuration again, the fleet will be updated in-place, meaning that you don't need to stop runs on the fleet instances if they are not affected by the changes (in this example, it's okay if the instance 1 is currenty busy, you can still apply the configuration).
$ dstack apply -f fleet.dstack.yml
...
Found fleet ssh-fleet. Configuration changes detected.
Update the fleet in-place? [y/n]: y
...
FLEET INSTANCE BACKEND RESOURCES PRICE STATUS CREATED
ssh-fleet 1 ssh (remote) cpu=2 mem=4GB disk=30GB $0 idle 09:08
2 ssh (remote) cpu=8 mem=4GB disk=30GB $0 idle 09:12
Note
For in-place updates it's only allowed to add and/or remove instances, the root configuration and configurations of hosts that are not changed must not be changed, otherwise the full fleet recreation is triggered, as before. This restriction may be lifted in the future.
Volumes
Automatic cleanup of unused volumes
The volume configuration gets a new auto_cleanup_duration property:
type: volume
name: my-volume
backend: aws
region: eu-west-1
availability_zone: eu-west-1a
auto_cleanup_duration: 1hThe volume will be automatically deleted after it's not being used for the specified duration.
Logs
Browsable, queryable, and searchable logs
dstack now stores run logs in plaintext, which were previously base64-encoded. This allows you to use the configured log storage, be it AWS CloudWatch or GCP Logging, to browse and query dstack run logs.
Note
Logs generated before this release will be shown as base64-encoded in the UI and CLI after the update.
Server
Faster API response times
The dstack server API has been optimized to serialize json responses faster. The API endpoints are up to 2x faster than before.
Benchmarks
Benchmarking AMD GPUs: bare-metal, containers, partitions
Our new benchmark explores two important areas for optimizing AI workloads on AMD GPUs: First, do containers introduce a performance penalty for network-intensive tasks compared to a bare-metal setup? Second, how does partitioning a powerful GPU like the MI300X affect its real-world performance for different types of AI workloads?
What's Changed
- [Internal] Some runner tests fail on macOS by @peterschmidt85 in #2879
- Introduce job_submissions_limit for /api/runs/list by @r4victor in #2883
- Speed up json serialization with orjson and custom FastAPI responses by @r4victor in #2880
- [Docs]: Service rolling deployments by @jvstme in #2870
- Do not lose
provisioninggateways on restart by @jvstme in #2887 - Add/remove SSH instances via in-place update by @un-def in #2884
- [Docs]: Add example of setting a PostgreSQL URL by @jvstme in #2888
- [Blog] Added new changelog by @peterschmidt85 in #2891
- Fix job_submissions_limit backward compatibility by @r4victor in #2894
- Fix run and job status_message calculation by @r4victor in #2889
- Fix 500 errors when requesting file logs by @r4victor in #2896
- Rolling deployments for
portby @jvstme in #2893 - [Feature] Strip ANSI codes from run logs and store them as plain text instead of bytes by @peterschmidt85 in #2876
- [Feature]: Add ability to disable background processing and only run Web UI and API server #2901 by @james-boydell in #2902
- [shim] Don't check image downloaded size by @un-def in #2903
- Fix rolling deployment migration locking by @r4victor in #2904
- feat: add volume idle duration cleanup feature (#2497) by @haydnli-shopify in #2842
- [Blog] Benchmarking AMD GPUs: bare-metal, containers, partitions by @peterschmidt85 in #2905
- Fix /users/list by @r4victor in #2908
- Return logs in base64 for backward compatibility by @r4victor in #2910
Full Changelog: 0.19.18...0.19.19
0.19.18
Server
Optimized resources processing
This release includes major improvements that allow the dstack server process more resources quickly. It also allows scaling processing rates of one server replica to take advantage of big Postgres instances by setting the DSTACK_SERVER_BACKGROUND_PROCESSING_FACTOR environment variable.
The result is:
- Faster processing rates: provisioning 100 runs on SQLite with default settings went from ~5m to ~2m.
- Better scaling: provisioning additional 100 runs is even quicker due to warm cache. Before, it was slower than the first 100 runs.
- Ability to process more runs per server replica: provisioning 300 runs on Postgres with
DSTACK_SERVER_BACKGROUND_PROCESSING_FACTOR=4is ~4m.
For more details on scaling backgraound processing rates, see the Server deployment guide.
Backends
Private GCP gateways
It's now possible to create GCP gateways without public IPs:
type: gateway
name: example
domain: gateway.example.com
backend: gcp
region: europe-west9
public_ip: false
certificate: nullNote that configuring HTTPS certificates for private GCP gateways is not yet supported, so you need to specify certificate: null.
What's Changed
- Ignore SSH keys when calculating fleet conf diff by @un-def in #2869
- [Blog] Refactoring by @peterschmidt85 in #2873
- Implemented fronted precommit linting by @olgenn in #2868
- Support processing more resources per replica by @r4victor in #2871
- Use uvloop by default by @r4victor in #2874
- Add server profiling by @r4victor in #2875
- Fix NVIDIA container toolkit bug in all backends by @jvstme in #2877
- Private GCP gateways by @jvstme in #2881
- Switch to
e2-mediumfor GCP gateways by @jvstme in #2886
Full Changelog: 0.19.17...0.19.18
0.19.17
Secrets
dstack gets support for secrets that allow centralized management of sensitive values such as API keys and credentials. They are project-scoped, managed by project admins, and can be referenced in run configurations to pass sensitive values to runs in a secure manner. Example:
$ dstack secret set my_secret some_secret_value
OKtype: task
nodes: 1
name: test-secrets
env:
- MY_SECRET=${{ secrets.my_secret }}
commands:
- echo $MY_SECRET$ dstack apply -f .dstack/confs/task.dstack.yaml
Submit the run test-task? [y/n]: y
NAME BACKEND RESOURCES PRICE STATUS SUBMITTED
test-task aws cpu=2 mem=8GB $0.107 running 10:48
(eu-west-1) disk=100GB
test-secrets provisioning completed (running)
some_secret_value
Exited (0)
For more details on secrets, check out the docs.
Files
By default, dstack automatically mounts the repo directory where you ran dstack init to any run configuration.
However, in some cases, you may not want to mount the entire directory (e.g., if it’s too large), or you might want to mount files outside of it. In such cases, you can use the files property.
type: task
name: trl-sft
files:
- .:examples # Maps the directory where `.dstack.yml` to `/workflow/examples`
- ~/.ssh/id_rsa # Maps `~/.ssh/id_rsa` to `/root/.ssh/id_rsa`
python: 3.12
env:
- HF_TOKEN
- HF_HUB_ENABLE_HF_TRANSFER=1
- MODEL=Qwen/Qwen2.5-0.5B
- DATASET=stanfordnlp/imdb
commands:
- uv pip install trl
- |
trl sft \
--model_name_or_path $MODEL --dataset_name $DATASET
--num_processes $DSTACK_GPUS_PER_NODE
resources:
gpu: H100:1Warning
If you have existing fleets, it's recommended to re-create them after upgrading to version 0.19.17. Otherwise, there is a risk that these instances won't be able to execute jobs if the run uses files.
Services
Rolling deployment
Rolling deployments introduced in 0.19.15 are now supported when deploying new commits or branches from a Git repo, or when changes are made to the repo contents or files listed in the files section.
Additionally, dstack apply now displays a full list of detected changes:
$ dstack apply -f my-service.dstack.yml
Active run my-service already exists. Detected changes that can be updated in-place:
- Repo state (branch, commit, or other)
- File archives
- Configuration properties:
- env
- files
Update the run? [y/n]:Even when a rolling deployment isn't possible, the list of changes is still shown — making it easier to identify which changes are preventing the deployment from proceeding in-place.
What's changed
- [Bug]: Docker In Docker does not work with AMD by @peterschmidt85 in #2849
- [Feature] Add
filesproperty to run configurations by @un-def in #2848 - [Feature] Implement project secrets by @r4victor in #2854
- [Internal] Support fleet configurations for the local backend by @jvstme in #2856
- [Services] Rolling deployments for repo updates by @jvstme in #2853
- [Internal] Fix package dependency direction by @jvstme in #2859
- [Internal] Rolling deployments for
filesby @jvstme in #2862 - [Internal] Support the local backend with the in-server proxy by @jvstme in #2858
- [Docs] Added
Filesdocumentation by @peterschmidt85 in #2866 - [Bug] Fix
~expansion infilesby @un-def in #2865 - [Feature] Allow in-place update for more run properties by @jvstme in #2867
Full changelog: 0.19.16...0.19.17
0.19.16
Docker
Docker in Docker
Using Docker in a run configuration is now much easier. Just set docker to true:
type: task
name: docker-nvidia-smi
docker: true
commands:
- docker run --gpus all nvidia/cuda:12.3.0-base-ubuntu22.04 nvidia-smi
resources:
gpu: 1This works with all run configuration types and supports both AMD and NVIDIA GPUs. It’s especially useful if you want to use the docker CLI in your commands—for example, to build Docker images.
The docker property is supported on all backends except vastai, runpod, and kubernetes, and is fully supported on SSH fleets as well.
Backends
CloudRift
The CloudRift team has added support for their GPU cloud, which can now be used with dstack.
To configure it, use a CloudRift API key in the backend configuration:
projects:
- name: main
backends:
- type: cloudrift
creds:
type: api_key
api_key: rift_2prgY1d0laOrf2BblTwx2B2d1zcf1zIp4tZYpj5j88qmNgz38pxNlpX3vAoCloudRift offers competitive on-demand GPU pricing, with more GPUs and regions coming soon.
dstack apply -f examples/.dstack.yml -b cloudrift
# BACKEND RESOURCES INSTANCE TYPE PRICE
1 cloudrift (us-east-nc-nr-1) cpu=16 mem=100GB disk=1000GB RTX5090:32GB:1 rtx59-16c-nr.1 $0.65If you encounter any issues with this backend, please report them.
Server
Public projects
You can now create public projects that any user on the server can join or leave without approval. Previously, all projects were private, and adding new members required manual action by an admin or manager—a step that’s redundant in high-trust environments.
Admins can change a project’s visibility at any time in the project settings.
Metrics
The server exports new Prometheus metrics:
dstack_submit_to_provision_duration_seconds: Time from when a run has been submitted and first job provisioningdstack_pending_runs_total: Total number of pending runs
What's changed
- [Feature]: Property filter on Fleets, Models, Volumes pages by @olgenn in #2824
- [Bug]: Run/job status in UI/CLI is shown as
provisioninginstead ofpullingby @peterschmidt85 in #2834 - [chore]: Fix annotation in
update_service_desired_replica_countby @jvstme in #2840 - Add CloudRift backend by @6erun in #2771
- Fix Postgres deadlocks by @r4victor in #2843
- [UX] Simplify the use of Docker inside containers #2468 by @peterschmidt85 in #2828
- [Docs] Update docs and examples to reflect the
dockerproperty by @peterschmidt85 in #2831 - Add support for Tenstorrent n300 GPUs by @peterschmidt85 in #2827
- [Feature]: Property filter on Instances page by @olgenn in #2826
- [UI] Allow to hide the Tour panel by @olgenn in #2816
- Pr3 add join leave UI buttons by @haydnli-shopify in #2795
- Health metrics (Part 2) by @Nadine-H in #2796
- [Bug]: Use a unique token for log pagination instead of a timestamp by @peterschmidt85 in #2845
- Fix update project required permissions by @r4victor in #2846
New contributors
Full changelog: 0.19.15...0.19.16
0.19.15
Services
Rolling deployments
This update introduces rolling deployments, which help avoid downtime when deploying new versions of your services.
When you apply an updated service configuration, dstack will gradually replace old service replicas with new ones. You can track the progress in the dstack apply output — the deployment number will be lower for old replicas and higher for new ones.
> dstack apply -f my-service.dstack.yml
Active run my-service already exists. Detected configuration changes that can be updated in-place: ['image', 'env', 'commands']
Update the run? [y/n]: y
⠋ Launching my-service...
NAME BACKEND RESOURCES PRICE STATUS SUBMITTED
my-service deployment=1 running 11 mins ago
replica=0 job=0 deployment=0 aws (us-west-2) cpu=2 mem=1GB disk=100GB (spot) $0.0026 terminating 11 mins ago
replica=1 job=0 deployment=1 aws (us-west-2) cpu=2 mem=1GB disk=100GB (spot) $0.0026 running 1 min agoCurrently, the following service configuration properties can be updated using rolling deployments: resources, volumes, image, user, privileged, entrypoint, python, nvcc, single_branch, env, shell, and commands.
Future releases will allow updating more properties and deploying new git repo commits.
Clusters
Updated default Docker images
If you don't specify a custom image in the run configuration, dstack uses its default images. These images have been improved for cluster environments and now include mpirun and NCCL tests. Additionally, if you are running on AWS EFA-capable instances, dstack will now automatically select an image with the appropriate EFA drivers. See our new AWS EFA guide for more details.
Server
Health metrics
The dstack server now exports some operational Prometheus metrics that allow to monitor its health. If you are running your own production-grade dstack server installation, refer to the metrics docs for details.
What's changed
- Set logsWaitDuration to 5m by @r4victor in #2794
- Add health metrics (Part 1) by @Nadine-H in #2760
- Add public projects by @haydnli-shopify in #2759
- Fix is_public allowing null by @r4victor in #2798
- Retry on
VOLUME_ERRORandINSTANCE_UNREACHABLEby @jvstme in #2805 - Rework default Docker images by @peterschmidt85 in #2799
- Fix volume error status message by @jvstme in #2806
- [Docs] Added EFA example by @peterschmidt85 in #2820
- [Bug]: Empty spaces on User Details page by @olgenn in #2815
- Rolling deployment for services by @jvstme in #2821
- Fix building
dstackpackage by @jvstme in #2823
New Contributors
- @haydnli-shopify made their first contribution in #2759
Full Changelog: 0.19.13...0.19.15
0.19.13
Clusters
Built-in InfiniBand support in dstack Docker images
The dstack default Docker images now come with built-in InfiniBand support, which includes the necessary libibverbs library and InfiniBand utilities from rdma-core. This means you can run torch distributed and other workloads utilizing NCCL, and they'll take full advantage of InfiniBand without custom Docker images.
You can try InfiniBand clusters with dstack on Nebius.
Built-in EFA support in dstack VM images
dstack switches to DLAMI as the default AWS GPU VM image from a custom one. DLAMI supports EFA out-of-the-box, so you no longer need to use a custom VM image to take advantage of EFA.
Server
GCS support for code uploads
It's now possible to configure the dstack server to use GCP Cloud Storage for code uploads. Previously, only DB and S3 storages were supported. Learn more in the Server deployment guide.
What's Changed
- Support file upload to gcs bucket by @colinjc in #2737
- Document File storage by @r4victor in #2755
- [Docs] Minor update of Clusters and Distributed tasks sections by @peterschmidt85 in #2741
- Fix CLI exiting while master starting by @r4victor in #2757
- [UI] Implement property filter on Run list page by @olgenn in #2762
- [Bug]: Text is unavailable for selection on run logs page by @olgenn in #2763
- Preinstall rdma-core packages into dstack Docker image by @r4victor in #2764
- [UX] Show status message as retrying in case a run or job is being retired by @peterschmidt85 in #2758
- [Docs] Minor improvements by @peterschmidt85 in #2766
- [Feature]: Include priority to the list of runs and sort runs by priority by @olgenn in #2768
- [Feature]: The Run details page should display the same fields as the Run list page by @olgenn in #2769
- [Feature]: Show Quickstart button if user don't have any runs by @olgenn in #2770
- [Feature]: Implement links for elements that have details page by @olgenn in #2772
- [Feature]: Add Refresh button on Run details page by @olgenn in #2773
- [Bug]: Tab Billing changes to Settings after top up balance by @olgenn in #2774
- Exclude backward incompatible fields from rest plugin calls by @colinjc in #2767
- [UI] Minor fixes by @peterschmidt85 in #2775
- Pin dkms by @r4victor in #2776
- Use DLAMI on AWS by @r4victor in #2782
- 2674 prop filter by @olgenn in #2778
- Fixed defect #2752 by @olgenn in #2784
- Update base image to 0.9 by @r4victor in #2786
- Fix status_message with missing on_events by @r4victor in #2788
- [Bug]: UI doesn't show Resources for instances of SSH fleets by @peterschmidt85 in #2785
- Ignore AWS quotas when hitting rate limits by @r4victor in #2791
Full Changelog: 0.19.12...0.19.13
0.19.12
Clusters
Simplified use of MPI
startup_order and stop_criteria
New run configuration properties are introduced:
startup_order: any/master-first/workers-firstspecifies the order in which master and workers jobs are started.stop_criteria: all-done/master-donespecifies the criteria when a multi-node run should be considered finished.
These properties simplify running certain multi-node workloads. For example, MPI requires that workers are up and running when the master runs mpirun, so you'd use startup_order: workers-first. MPI workload can be considered done when the master is done, so you'd use stop_criteria: master-done and dstack won't wait for workers to exit.
DSTACK_MPI_HOSTFILE
dstack now automatically creates an MPI hostfile and exposes the DSTACK_MPI_HOSTFILE environment variable with the hostfile path. It can be used directly as mpirun --hostfile $DSTACK_MPI_HOSTFILE.
Below is the updated NCCL tests example.
CLI
We've also updated how the CLI displays run and job status. Previously, the CLI displayed the internal status code which was hard to interpret. Now, the the STATUS column in dstack ps and dstack apply displays a status code which is easy to understand why run or job was terminated.
Examples
Distributed training
TRL
The new TRL example walks you through how to run distributed fine-tune using TRL, Accelerate and Deepspeed.
Axolotl
The new Axolotl example walks you through how to run distributed fine-tune using Axolotl with dstack.
What's changed
- [Feature] Update
.gitignorelogic to catch more cases by @colinjc in #2695 - [Bug] Increase
upload_codeclient timeout by @r4victor in #2709 - [Bug] Fix missing
apt-get updateby @r4victor in #2710 - [Internal]: Update git hooks and
package.jsonby @olgenn in #2706 - [Examples] Add distributed Axolotl and TRL example by @Bihan in #2703
- [Docs] Update
dstack-proxycontributing guide by @jvstme in #2683 - [Feature] Implement
DSTACK_MPI_HOSTFILEby @r4victor in #2718 - [Feature] Implement
startup_orderandstop_criteriaby @r4victor in #2714 - [Bug] Fix CLI exiting while master starting by @r4victor in #2720
- [Examples] Simplify NCCL tests example by @r4victor in #2723
- [Examples] Update TRL Single Node example to uv by @Bihan in #2715
- [Bug] Fix backward compatibility when creating fleets by @jvstme in #2727
- [UX]: Make run status in UI and CLI easier to understand by @peterschmidt85 in #2716
- [Bug] Fix relative paths in
dstack apply --repoby @jvstme in #2733 - [Internal]: Drop hardcoded regions from the backend template by @jvstme in #2734
- [Internal]: Update backend template to match
ruffformatting by @jvstme in #2735
Full changelog: 0.19.11...0.19.12
0.19.12rc1
What's Changed
Full Changelog: 0.19.11...0.19.12rc1
0.19.11
Runs
Replacing conda with uv
dstack's default Docker images now come with uv installed. Installing Python packages with uv can be significantly faster than with pip or conda. Here's for example, uv vs pip times for installing torch on GCP VMs:
# time uv pip install torch
...
real 0m32.771s
user 0m29.070s
sys 0m8.300s
# time pip install torch
...
real 2m26.338s
user 1m37.514s
sys 0m16.711s
To continue supporting pip, dstack now automatically activates a virtual environment with pip available.
conda is no longer included in dstack's default Docker images. If you need to use conda, it should be installed manually:
commands:
- wget -O miniconda.sh https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
- bash miniconda.sh -b -p /workflow/miniconda
- eval "$(/workflow/miniconda/bin/conda shell.bash hook)"
Plugins
Built-in rest_plugin
dstack gets support for a built-in rest_plugin that allows writing custom plugins as API servers, so you don't need to install plugins as Python packages.
Plugins implemented as API servers have advantages over plugins implemented as Python packages in some cases:
- No dependency conflicts with
dstack. - You can use any programming language.
- If you run the
dstackserver via Docker, you don't need to extend thedstackserver image with plugins or map them via volumes.
To get started, check out the plugin server example. The rest_plugin server API is documented here.
AWS
New CPU series
dstack now supports most recent AWS CPU VMs based on Intel Xeon Sapphire Rapids: M7i, C7i, and R7i. It also adds support for the burstable T3 family. Previously, only M5, C5 and t2.small CPU instances were supported.
Azure
New CPU series
dstack now supports most recent Azure CPU VMs based on Intel Xeon Sapphire Rapids: general purpose Dsv6 and memory optimized Esv6 series. Previously, only Dsv3, Esv4, and Fsv2 series were supported.
GCP
New CPU series
dstack now supports most recent GCP CPU VMs: C4, M4, H3, N4, N2. Previously, only E2 and M1 were supported.
Note that C4, M4, H3, N4 instances do not currently support Volumes since they require Hyperdisk support.
Examples
Ray+RAGEN
The new Ray+RAGEN example shows how use dstack and RAGEN to fine-tune an agent on multiple nodes.
Breaking changes
condais no longer included indstack's default Docker images.
Deprecations
- Azure VM series Dsv3 and Esv4 are deprecated.
What's Changed
- [Examples] Ray+RAGEN by @Bihan in #2665
- [UX] Minor improvements of
dstack metricsby @peterschmidt85 in #2667 - Fix request filtering for service stats by @jvstme in #2678
- Auto activate uv venv with pip installed by @r4victor in #2666
- Support new Azure CPU series by @r4victor in #2668
- [Blog] Case study: how EA uses dstack to fast-track AI development by @peterschmidt85 in #2682
- Add REST plugin for user-defined policies by @Nadine-H in #2631
- [UI] Minor update of help messages by @peterschmidt85 in #2690
- Fix wrong env var name in error message by @colinjc in #2686
- Fix upload_code limit message by @r4victor in #2691
- Support new GCP CPU series by @r4victor in #2685
- Drop humanize by @r4victor in #2692
- Support new AWS CPU series by @r4victor in #2693
- Disable max code upload limit in runner by @colinjc in #2694
- Generate REST plugin API docs by @Nadine-H in #2696
- Fix docs-build by @r4victor in #2700
- [UX]: Only show update notices for stable releases #2697 by @peterschmidt85 in #2698
- Run plugins in executor by @r4victor in #2701
- Fix phantom priority changes detected by @r4victor in #2702
- Update GRID drivers in Azure VM image by @jvstme in #2704
New Contributors
Full Changelog: 0.19.10...0.19.11