Skip to content

Commit a0aa057

Browse files
- [Docs] Updated docs and examples to reflect the changes in 0.11.1 (part 2)
1 parent 9b5227f commit a0aa057

File tree

9 files changed

+122
-82
lines changed

9 files changed

+122
-82
lines changed

README.md

Lines changed: 7 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -9,7 +9,7 @@
99
</h1>
1010

1111
<h3 align="center">
12-
Train and deploy LLM models in multiple clouds
12+
Run LLM workloads across any clouds
1313
</h3>
1414

1515
<p align="center">
@@ -23,18 +23,16 @@ Train and deploy LLM models in multiple clouds
2323
[![PyPI - License](https://img.shields.io/pypi/l/dstack?style=flat-square&color=blue)](https://github.com/dstackai/dstack/blob/master/LICENSE.md)
2424
</div>
2525

26-
`dstack` is an open-source tool that enables the execution of LLM workloads
27-
across multiple cloud providers – ensuring the best GPU price and availability.
26+
`dstack` is an open-source toolkit for running LLM workloads across any clouds, offering a
27+
cost-efficient and user-friendly interface for training, inference, and development.
2828

29-
Deploy services, run tasks, and provision dev environments
30-
in a cost-effective manner across multiple cloud GPU providers.
31-
32-
## Latest news
29+
## Latest news ✨
3330

3431
- [2023/08] [Fine-tuning with Llama 2](https://dstack.ai/examples/finetuning-llama-2) (Example)
3532
- [2023/08] [An early preview of services](https://dstack.ai/blog/2023/08/07/services-preview) (Release)
36-
- [2023/07] [Port mapping, max duration, and more](https://dstack.ai/blog/2023/07/25/port-mapping-max-duration-and-more) (Release)
37-
- [2023/07] [Serving with vLLM](https://dstack.ai/examples/vllm) (Example)
33+
- [2023/08] [Serving SDXL with FastAPI](https://dstack.ai/examples/stable-diffusion-xl) (Example)
34+
- [2023/07] [Serving LLMS with TGI](https://dstack.ai/examples/text-generation-inference) (Example)
35+
- [2023/07] [Serving LLMS with vLLM](https://dstack.ai/examples/vllm) (Example)
3836

3937
## Installation
4038

docs/blog/posts/multiple-clouds.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -7,7 +7,7 @@ categories:
77
- Releases
88
---
99

10-
# Discover GPU across multiple clouds
10+
# Automatic GPU discovery across clouds
1111

1212
__The 0.11 update significantly cuts GPU costs and boosts their availability.__
1313

@@ -16,7 +16,7 @@ configured cloud providers and regions.
1616

1717
<!-- more -->
1818

19-
## Multiple clouds per project
19+
## Multiple backends per project
2020

2121
Now, `dstack` leverages price data from multiple configured cloud providers and regions to automatically suggest the
2222
most cost-effective options.

docs/examples/text-generation-inference.md

Lines changed: 45 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -31,13 +31,11 @@ Here's the configuration that uses services:
3131

3232
```yaml
3333
type: service
34-
# This configuration deploys a given LLM model as an API
3534

3635
image: ghcr.io/huggingface/text-generation-inference:latest
3736

3837
env:
39-
# (Required) Specify the name of the model
40-
- MODEL_ID=tiiuae/falcon-7b
38+
- MODEL_ID=NousResearch/Llama-2-7b-hf
4139

4240
port: 8000
4341

@@ -84,11 +82,50 @@ $ curl -X POST --location https://yellow-cat-1.mydomain.com \
8482

8583
</div>
8684

87-
!!! info "Gated models"
88-
To use a model with gated access, ensure configuring either the `HUGGING_FACE_HUB_TOKEN` secret
89-
(using [`dstack secrets`](../docs/reference/cli/secrets.md#dstack-secrets-add)),
90-
or environment variable (with [`--env`](../docs/reference/cli/run.md#ENV) in `dstack run` or
91-
using [`env`](../docs/reference/dstack.yml/service.md#env) in the configuration file).
85+
### Gated models
86+
87+
To use a model with gated access, ensure configuring either the `HUGGING_FACE_HUB_TOKEN` secret
88+
(using [`dstack secrets`](../docs/reference/cli/secrets.md#dstack-secrets-add)),
89+
or environment variable (with [`--env`](../docs/reference/cli/run.md#ENV) in `dstack run` or
90+
using [`env`](../docs/reference/dstack.yml/service.md#env) in the configuration file).
91+
92+
<div class="termy">
93+
94+
```shell
95+
$ dstack run . -f text-generation-inference/serve.dstack.yml --env HUGGING_FACE_HUB_TOKEN=&lt;token&gt; --gpu 24GB
96+
```
97+
</div>
98+
99+
### Memory usage and quantization
100+
101+
An LLM typically requires twice the GPU memory compared to its parameter count. For instance, a model with `13B` parameters
102+
needs around `26GB` of GPU memory. To decrease memory usage and fit the model on a smaller GPU, consider using
103+
quantization, which TGI offers as `bitsandbytes` and `gptq` methods.
104+
105+
Here's an example of the Llama 2 13B model tailored for a `24GB` GPU (A10 or L4):
106+
107+
<div editor-title="text-generation-inference/serve.dstack.yml">
108+
109+
```yaml
110+
type: service
111+
112+
image: ghcr.io/huggingface/text-generation-inference:latest
113+
114+
env:
115+
- MODEL_ID=TheBloke/Llama-2-13B-GPTQ
116+
117+
port: 8000
118+
119+
commands:
120+
- text-generation-launcher --hostname 0.0.0.0 --port 8000 --trust-remote-code --quantize gptq
121+
```
122+
123+
</div>
124+
125+
A similar approach allows running the Llama 2 70B model on an `80GB` GPU (A100).
126+
127+
To calculate the exact GPU memory required for a specific model with different quantization methods, you can use the
128+
[hf-accelerate/memory-model-usage](https://huggingface.co/spaces/hf-accelerate/model-memory-usage) Space.
92129

93130
??? info "Dev environments"
94131

docs/examples/vllm.md

Lines changed: 15 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -31,12 +31,10 @@ Here's the configuration that uses services to run an LLM as an OpenAI-compatibl
3131
```yaml
3232
type: service
3333

34-
# (Optional) If not specified, it will use your local version
3534
python: "3.11"
3635

3736
env:
38-
# (Required) Specify the name of the model
39-
- MODEL=facebook/opt-125m
37+
- MODEL=NousResearch/Llama-2-7b-hf
4038

4139
port: 8000
4240

@@ -75,7 +73,7 @@ Once the service is up, you can query the endpoint:
7573
$ curl -X POST --location https://yellow-cat-1.mydomain.com/v1/completions \
7674
-H "Content-Type: application/json" \
7775
-d '{
78-
"model": "facebook/opt-125m",
76+
"model": "NousResearch/Llama-2-7b-hf",
7977
"prompt": "San Francisco is a",
8078
"max_tokens": 7,
8179
"temperature": 0
@@ -84,10 +82,18 @@ $ curl -X POST --location https://yellow-cat-1.mydomain.com/v1/completions \
8482

8583
</div>
8684

87-
!!! info "Gated models"
88-
To use a model with gated access, ensure configuring either the `HUGGING_FACE_HUB_TOKEN` secret
89-
(using [`dstack secrets`](../docs/reference/cli/secrets.md#dstack-secrets-add)),
90-
or environment variable (with [`--env`](../docs/reference/cli/run.md#ENV) in `dstack run` or
91-
using [`env`](../docs/reference/dstack.yml/service.md#env) in the configuration file).
85+
### Gated models
86+
87+
To use a gated-access model from Hugging Face Hub, make sure to set up either the `HUGGING_FACE_HUB_TOKEN` secret
88+
(using [`dstack secrets`](../docs/reference/cli/secrets.md#dstack-secrets-add)),
89+
or environment variable (with [`--env`](../docs/reference/cli/run.md#ENV) in `dstack run` or
90+
using [`env`](../docs/reference/dstack.yml/service.md#env) in the configuration file).
91+
92+
<div class="termy">
93+
94+
```shell
95+
$ dstack run . -f vllm/serve.dstack.yml --env HUGGING_FACE_HUB_TOKEN=&lt;token&gt; --gpu 24GB
96+
```
97+
</div>
9298

9399
[Source code](https://github.com/dstackai/dstack-examples){ .md-button .md-button--github }

docs/index.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
---
22
template: home.html
3-
title: Train and deploy LLM models in multiple clouds
3+
title: Run LLM workloads across any clouds
44
hide:
55
- navigation
66
- toc

docs/overrides/examples.html

Lines changed: 21 additions & 21 deletions
Original file line numberDiff line numberDiff line change
@@ -9,7 +9,7 @@ <h2>Examples</h2>
99
</div>
1010

1111
<div class="tx-landing__highlights_grid">
12-
<a href="finetuning-llama-2">
12+
<a href="/examples/finetuning-llama-2">
1313
<div class="feature-cell">
1414
<div class="feature-icon">
1515
<svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 24 24">
@@ -27,63 +27,63 @@ <h3>
2727
</div>
2828
</a>
2929

30-
<a href="stable-diffusion-xl">
30+
<a href="/examples/text-generation-inference">
3131
<div class="feature-cell">
3232
<div class="feature-icon">
3333
<svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 24 24">
34-
<path d="M17.5 12a1.5 1.5 0 0 1-1.5-1.5A1.5 1.5 0 0 1 17.5 9a1.5 1.5 0 0 1 1.5 1.5 1.5 1.5 0 0 1-1.5 1.5m-3-4A1.5 1.5 0 0 1 13 6.5 1.5 1.5 0 0 1 14.5 5 1.5 1.5 0 0 1 16 6.5 1.5 1.5 0 0 1 14.5 8m-5 0A1.5 1.5 0 0 1 8 6.5 1.5 1.5 0 0 1 9.5 5 1.5 1.5 0 0 1 11 6.5 1.5 1.5 0 0 1 9.5 8m-3 4A1.5 1.5 0 0 1 5 10.5 1.5 1.5 0 0 1 6.5 9 1.5 1.5 0 0 1 8 10.5 1.5 1.5 0 0 1 6.5 12M12 3a9 9 0 0 0-9 9 9 9 0 0 0 9 9 1.5 1.5 0 0 0 1.5-1.5c0-.39-.15-.74-.39-1-.23-.27-.38-.62-.38-1a1.5 1.5 0 0 1 1.5-1.5H16a5 5 0 0 0 5-5c0-4.42-4.03-8-9-8Z"></path>
34+
<path d="M16 9h3l-5 7m-4-7h4l-2 8M5 9h3l2 7m5-12h2l2 3h-3m-5-3h2l1 3h-4M7 4h2L8 7H5m1-5L2 8l10 14L22 8l-4-6H6Z"></path>
3535
</svg>
3636
</div>
3737
<h3>
38-
Serving SDXL with FastAPI
38+
Serving LLMs with TGI
3939
</h3>
4040

4141
<p>
42-
Serving <strong>Stable Diffusion XL</strong> with <strong>FastAPI</strong> to generate
43-
and refine images via a REST endpoint.
42+
Serve open-source LLMs as APIs with optimized performance using <strong>TGI</strong>, an
43+
open-source tool by
44+
Hugging Face.
4445
</p>
4546
</div>
4647
</a>
4748

48-
<a href="vllm">
49+
<a href="/examples/stable-diffusion-xl">
4950
<div class="feature-cell">
5051
<div class="feature-icon">
51-
<svg xmlns="http://www.w3.org/2000/svg" viewBox="-3 -3 27 27">
52-
<path d="m13.13 22.19-1.63-3.83c1.57-.58 3.04-1.36 4.4-2.27l-2.77 6.1M5.64 12.5l-3.83-1.63 6.1-2.77C7 9.46 6.22 10.93 5.64 12.5M21.61 2.39S16.66.269 11 5.93c-2.19 2.19-3.5 4.6-4.35 6.71-.28.75-.09 1.57.46 2.13l2.13 2.12c.55.56 1.37.74 2.12.46A19.1 19.1 0 0 0 18.07 13c5.66-5.66 3.54-10.61 3.54-10.61m-7.07 7.07c-.78-.78-.78-2.05 0-2.83s2.05-.78 2.83 0c.77.78.78 2.05 0 2.83-.78.78-2.05.78-2.83 0m-5.66 7.07-1.41-1.41 1.41 1.41M6.24 22l3.64-3.64c-.34-.09-.67-.24-.97-.45L4.83 22h1.41M2 22h1.41l4.77-4.76-1.42-1.41L2 20.59V22m0-2.83 4.09-4.08c-.21-.3-.36-.62-.45-.97L2 17.76v1.41Z"></path>
52+
<svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 24 24">
53+
<path d="M17.5 12a1.5 1.5 0 0 1-1.5-1.5A1.5 1.5 0 0 1 17.5 9a1.5 1.5 0 0 1 1.5 1.5 1.5 1.5 0 0 1-1.5 1.5m-3-4A1.5 1.5 0 0 1 13 6.5 1.5 1.5 0 0 1 14.5 5 1.5 1.5 0 0 1 16 6.5 1.5 1.5 0 0 1 14.5 8m-5 0A1.5 1.5 0 0 1 8 6.5 1.5 1.5 0 0 1 9.5 5 1.5 1.5 0 0 1 11 6.5 1.5 1.5 0 0 1 9.5 8m-3 4A1.5 1.5 0 0 1 5 10.5 1.5 1.5 0 0 1 6.5 9 1.5 1.5 0 0 1 8 10.5 1.5 1.5 0 0 1 6.5 12M12 3a9 9 0 0 0-9 9 9 9 0 0 0 9 9 1.5 1.5 0 0 0 1.5-1.5c0-.39-.15-.74-.39-1-.23-.27-.38-.62-.38-1a1.5 1.5 0 0 1 1.5-1.5H16a5 5 0 0 0 5-5c0-4.42-4.03-8-9-8Z"></path>
5354
</svg>
5455
</div>
5556
<h3>
56-
Serving LLMs with vLLM
57+
Serving SDXL with FastAPI
5758
</h3>
5859

5960
<p>
60-
Serve open-source LLMs as OpenAI-compatible APIs with up to 24 times higher throughput using
61-
the
62-
<strong>vLLM</strong> library.
61+
Serving <strong>Stable Diffusion XL</strong> with <strong>FastAPI</strong> to generate
62+
and refine images via a REST endpoint.
6363
</p>
6464
</div>
6565
</a>
6666

67-
<a href="text-generation-inference">
67+
<a href="/examples/vllm">
6868
<div class="feature-cell">
6969
<div class="feature-icon">
70-
<svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 24 24">
71-
<path d="M16 9h3l-5 7m-4-7h4l-2 8M5 9h3l2 7m5-12h2l2 3h-3m-5-3h2l1 3h-4M7 4h2L8 7H5m1-5L2 8l10 14L22 8l-4-6H6Z"></path>
70+
<svg xmlns="http://www.w3.org/2000/svg" viewBox="-3 -3 27 27">
71+
<path d="m13.13 22.19-1.63-3.83c1.57-.58 3.04-1.36 4.4-2.27l-2.77 6.1M5.64 12.5l-3.83-1.63 6.1-2.77C7 9.46 6.22 10.93 5.64 12.5M21.61 2.39S16.66.269 11 5.93c-2.19 2.19-3.5 4.6-4.35 6.71-.28.75-.09 1.57.46 2.13l2.13 2.12c.55.56 1.37.74 2.12.46A19.1 19.1 0 0 0 18.07 13c5.66-5.66 3.54-10.61 3.54-10.61m-7.07 7.07c-.78-.78-.78-2.05 0-2.83s2.05-.78 2.83 0c.77.78.78 2.05 0 2.83-.78.78-2.05.78-2.83 0m-5.66 7.07-1.41-1.41 1.41 1.41M6.24 22l3.64-3.64c-.34-.09-.67-.24-.97-.45L4.83 22h1.41M2 22h1.41l4.77-4.76-1.42-1.41L2 20.59V22m0-2.83 4.09-4.08c-.21-.3-.36-.62-.45-.97L2 17.76v1.41Z"></path>
7272
</svg>
7373
</div>
7474
<h3>
75-
Serving LLMs with TGI
75+
Serving LLMs with vLLM
7676
</h3>
7777

7878
<p>
79-
Serve open-source LLMs as APIs with optimized performance using <strong>TGI</strong>, an
80-
open-source tool by
81-
Hugging Face.
79+
Serve open-source LLMs as OpenAI-compatible APIs with up to 24 times higher throughput using
80+
the
81+
<strong>vLLM</strong> library.
8282
</p>
8383
</div>
8484
</a>
8585

86-
<a href="llmchat">
86+
<a href="/examples/llmchat">
8787
<div class="feature-cell">
8888
<div class="feature-icon">
8989
<svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 24 24">

0 commit comments

Comments
 (0)