- [Docs] Updated docs and examples to reflect the changes in 0.11.1 (part 2)

peterschmidt85 · peterschmidt85 · commit a0aa057f2a79 · 2023-08-31T16:31:31.000+02:00
diff --git a/README.md b/README.md
@@ -9,7 +9,7 @@
 </h1>
 
 <h3 align="center">
-Train and deploy LLM models in multiple clouds
+Run LLM workloads across any clouds
 </h3>
 
 <p align="center">
@@ -23,18 +23,16 @@ Train and deploy LLM models in multiple clouds
 [![PyPI - License](https://img.shields.io/pypi/l/dstack?style=flat-square&color=blue)](https://github.com/dstackai/dstack/blob/master/LICENSE.md)
 </div>
 
-`dstack` is an open-source tool that enables the execution of LLM workloads
-across multiple cloud providers – ensuring the best GPU price and availability.
+`dstack` is an open-source toolkit for running LLM workloads across any clouds, offering a
+cost-efficient and user-friendly interface for training, inference, and development.
 
-Deploy services, run tasks, and provision dev environments
-in a cost-effective manner across multiple cloud GPU providers.
-
-## Latest news
+## Latest news ✨
 
 - [2023/08] [Fine-tuning with Llama 2](https://dstack.ai/examples/finetuning-llama-2) (Example)
 - [2023/08] [An early preview of services](https://dstack.ai/blog/2023/08/07/services-preview) (Release)
-- [2023/07] [Port mapping, max duration, and more](https://dstack.ai/blog/2023/07/25/port-mapping-max-duration-and-more) (Release)
-- [2023/07] [Serving with vLLM](https://dstack.ai/examples/vllm) (Example)
+- [2023/08] [Serving SDXL with FastAPI](https://dstack.ai/examples/stable-diffusion-xl) (Example)
+- [2023/07] [Serving LLMS with TGI](https://dstack.ai/examples/text-generation-inference) (Example)
+- [2023/07] [Serving LLMS with vLLM](https://dstack.ai/examples/vllm) (Example)
 
 ## Installation
 
diff --git a/docs/blog/posts/multiple-clouds.md b/docs/blog/posts/multiple-clouds.md
@@ -7,7 +7,7 @@ categories:
 - Releases
 ---
 
-# Discover GPU across multiple clouds
+# Automatic GPU discovery across clouds 
 
 __The 0.11 update significantly cuts GPU costs and boosts their availability.__
 
@@ -16,7 +16,7 @@ configured cloud providers and regions.
 
 <!-- more -->
 
-## Multiple clouds per project
+## Multiple backends per project
 
 Now, `dstack` leverages price data from multiple configured cloud providers and regions to automatically suggest the
 most cost-effective options.
diff --git a/docs/examples/text-generation-inference.md b/docs/examples/text-generation-inference.md
@@ -31,13 +31,11 @@ Here's the configuration that uses services:
 
 ```yaml
 type: service
-# This configuration deploys a given LLM model as an API
 
 image: ghcr.io/huggingface/text-generation-inference:latest
 
 env:
-  # (Required) Specify the name of the model
-  - MODEL_ID=tiiuae/falcon-7b
+      - MODEL_ID=NousResearch/Llama-2-7b-hf
 
 port: 8000
 
@@ -84,11 +82,50 @@ $ curl -X POST --location https://yellow-cat-1.mydomain.com \
 
 </div>
 
-!!! info "Gated models"
-    To use a model with gated access, ensure configuring either the `HUGGING_FACE_HUB_TOKEN` secret
-    (using [`dstack secrets`](../docs/reference/cli/secrets.md#dstack-secrets-add)),
-    or environment variable (with [`--env`](../docs/reference/cli/run.md#ENV) in `dstack run` or 
-    using [`env`](../docs/reference/dstack.yml/service.md#env) in the configuration file).
+### Gated models
+
+To use a model with gated access, ensure configuring either the `HUGGING_FACE_HUB_TOKEN` secret
+(using [`dstack secrets`](../docs/reference/cli/secrets.md#dstack-secrets-add)),
+or environment variable (with [`--env`](../docs/reference/cli/run.md#ENV) in `dstack run` or 
+using [`env`](../docs/reference/dstack.yml/service.md#env) in the configuration file).
+
+<div class="termy">
+
+```shell
+$ dstack run . -f text-generation-inference/serve.dstack.yml --env HUGGING_FACE_HUB_TOKEN=&lt;token&gt; --gpu 24GB
+```
+</div>
+
+### Memory usage and quantization
+
+An LLM typically requires twice the GPU memory compared to its parameter count. For instance, a model with `13B` parameters
+needs around `26GB` of GPU memory. To decrease memory usage and fit the model on a smaller GPU, consider using
+quantization, which TGI offers as `bitsandbytes` and `gptq` methods. 
+
+Here's an example of the Llama 2 13B model tailored for a `24GB` GPU (A10 or L4):
+
+<div editor-title="text-generation-inference/serve.dstack.yml"> 
+
+```yaml
+type: service
+
+image: ghcr.io/huggingface/text-generation-inference:latest
+
+env:
+  - MODEL_ID=TheBloke/Llama-2-13B-GPTQ
+
+port: 8000
+
+commands: 
+  - text-generation-launcher --hostname 0.0.0.0 --port 8000 --trust-remote-code --quantize gptq
+```
+
+</div>
+
+A similar approach allows running the Llama 2 70B model on an `80GB` GPU (A100).
+
+To calculate the exact GPU memory required for a specific model with different quantization methods, you can use the
+[hf-accelerate/memory-model-usage](https://huggingface.co/spaces/hf-accelerate/model-memory-usage) Space.
 
 ??? info "Dev environments"
 
diff --git a/docs/examples/vllm.md b/docs/examples/vllm.md
@@ -31,12 +31,10 @@ Here's the configuration that uses services to run an LLM as an OpenAI-compatibl
 ```yaml
 type: service
 
-# (Optional) If not specified, it will use your local version
 python: "3.11"
 
 env:
-  # (Required) Specify the name of the model
-  - MODEL=facebook/opt-125m
+  - MODEL=NousResearch/Llama-2-7b-hf
 
 port: 8000
 
@@ -75,7 +73,7 @@ Once the service is up, you can query the endpoint:
 $ curl -X POST --location https://yellow-cat-1.mydomain.com/v1/completions \
     -H "Content-Type: application/json" \
     -d '{
-          "model": "facebook/opt-125m",
+          "model": "NousResearch/Llama-2-7b-hf",
           "prompt": "San Francisco is a",
           "max_tokens": 7,
           "temperature": 0
@@ -84,10 +82,18 @@ $ curl -X POST --location https://yellow-cat-1.mydomain.com/v1/completions \
 
 </div>
 
-!!! info "Gated models"
-    To use a model with gated access, ensure configuring either the `HUGGING_FACE_HUB_TOKEN` secret
-    (using [`dstack secrets`](../docs/reference/cli/secrets.md#dstack-secrets-add)),
-    or environment variable (with [`--env`](../docs/reference/cli/run.md#ENV) in `dstack run` or 
-    using [`env`](../docs/reference/dstack.yml/service.md#env) in the configuration file).
+### Gated models
+
+To use a gated-access model from Hugging Face Hub, make sure to set up either the `HUGGING_FACE_HUB_TOKEN` secret
+(using [`dstack secrets`](../docs/reference/cli/secrets.md#dstack-secrets-add)),
+or environment variable (with [`--env`](../docs/reference/cli/run.md#ENV) in `dstack run` or 
+using [`env`](../docs/reference/dstack.yml/service.md#env) in the configuration file).
+
+<div class="termy">
+
+```shell
+$ dstack run . -f vllm/serve.dstack.yml --env HUGGING_FACE_HUB_TOKEN=&lt;token&gt; --gpu 24GB
+```
+</div>
 
 [Source code](https://github.com/dstackai/dstack-examples){ .md-button .md-button--github }
diff --git a/docs/index.md b/docs/index.md
@@ -1,6 +1,6 @@
 ---
 template: home.html
-title: Train and deploy LLM models in multiple clouds
+title: Run LLM workloads across any clouds
 hide:
    - navigation
    - toc
diff --git a/docs/overrides/examples.html b/docs/overrides/examples.html
@@ -9,7 +9,7 @@ <h2>Examples</h2>
             </div>
 
             <div class="tx-landing__highlights_grid">
-                <a href="finetuning-llama-2">
+                <a href="/examples/finetuning-llama-2">
                     <div class="feature-cell">
                         <div class="feature-icon">
                             <svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 24 24">
@@ -27,63 +27,63 @@ <h3>
                     </div>
                 </a>
 
-                <a href="stable-diffusion-xl">
+                <a href="/examples/text-generation-inference">
                     <div class="feature-cell">
                         <div class="feature-icon">
                             <svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 24 24">
-                                <path d="M17.5 12a1.5 1.5 0 0 1-1.5-1.5A1.5 1.5 0 0 1 17.5 9a1.5 1.5 0 0 1 1.5 1.5 1.5 1.5 0 0 1-1.5 1.5m-3-4A1.5 1.5 0 0 1 13 6.5 1.5 1.5 0 0 1 14.5 5 1.5 1.5 0 0 1 16 6.5 1.5 1.5 0 0 1 14.5 8m-5 0A1.5 1.5 0 0 1 8 6.5 1.5 1.5 0 0 1 9.5 5 1.5 1.5 0 0 1 11 6.5 1.5 1.5 0 0 1 9.5 8m-3 4A1.5 1.5 0 0 1 5 10.5 1.5 1.5 0 0 1 6.5 9 1.5 1.5 0 0 1 8 10.5 1.5 1.5 0 0 1 6.5 12M12 3a9 9 0 0 0-9 9 9 9 0 0 0 9 9 1.5 1.5 0 0 0 1.5-1.5c0-.39-.15-.74-.39-1-.23-.27-.38-.62-.38-1a1.5 1.5 0 0 1 1.5-1.5H16a5 5 0 0 0 5-5c0-4.42-4.03-8-9-8Z"></path>
+                                <path d="M16 9h3l-5 7m-4-7h4l-2 8M5 9h3l2 7m5-12h2l2 3h-3m-5-3h2l1 3h-4M7 4h2L8 7H5m1-5L2 8l10 14L22 8l-4-6H6Z"></path>
                             </svg>
                         </div>
                         <h3>
-                            Serving SDXL with FastAPI
+                            Serving LLMs with TGI
                         </h3>
 
                         <p>
-                            Serving <strong>Stable Diffusion XL</strong> with <strong>FastAPI</strong> to generate
-                            and refine images via a REST endpoint.
+                            Serve open-source LLMs as APIs with optimized performance using <strong>TGI</strong>, an
+                            open-source tool by
+                            Hugging Face.
                         </p>
                     </div>
                 </a>
 
-                <a href="vllm">
+                <a href="/examples/stable-diffusion-xl">
                     <div class="feature-cell">
                         <div class="feature-icon">
-                            <svg xmlns="http://www.w3.org/2000/svg" viewBox="-3 -3 27 27">
-                                <path d="m13.13 22.19-1.63-3.83c1.57-.58 3.04-1.36 4.4-2.27l-2.77 6.1M5.64 12.5l-3.83-1.63 6.1-2.77C7 9.46 6.22 10.93 5.64 12.5M21.61 2.39S16.66.269 11 5.93c-2.19 2.19-3.5 4.6-4.35 6.71-.28.75-.09 1.57.46 2.13l2.13 2.12c.55.56 1.37.74 2.12.46A19.1 19.1 0 0 0 18.07 13c5.66-5.66 3.54-10.61 3.54-10.61m-7.07 7.07c-.78-.78-.78-2.05 0-2.83s2.05-.78 2.83 0c.77.78.78 2.05 0 2.83-.78.78-2.05.78-2.83 0m-5.66 7.07-1.41-1.41 1.41 1.41M6.24 22l3.64-3.64c-.34-.09-.67-.24-.97-.45L4.83 22h1.41M2 22h1.41l4.77-4.76-1.42-1.41L2 20.59V22m0-2.83 4.09-4.08c-.21-.3-.36-.62-.45-.97L2 17.76v1.41Z"></path>
+                            <svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 24 24">
+                                <path d="M17.5 12a1.5 1.5 0 0 1-1.5-1.5A1.5 1.5 0 0 1 17.5 9a1.5 1.5 0 0 1 1.5 1.5 1.5 1.5 0 0 1-1.5 1.5m-3-4A1.5 1.5 0 0 1 13 6.5 1.5 1.5 0 0 1 14.5 5 1.5 1.5 0 0 1 16 6.5 1.5 1.5 0 0 1 14.5 8m-5 0A1.5 1.5 0 0 1 8 6.5 1.5 1.5 0 0 1 9.5 5 1.5 1.5 0 0 1 11 6.5 1.5 1.5 0 0 1 9.5 8m-3 4A1.5 1.5 0 0 1 5 10.5 1.5 1.5 0 0 1 6.5 9 1.5 1.5 0 0 1 8 10.5 1.5 1.5 0 0 1 6.5 12M12 3a9 9 0 0 0-9 9 9 9 0 0 0 9 9 1.5 1.5 0 0 0 1.5-1.5c0-.39-.15-.74-.39-1-.23-.27-.38-.62-.38-1a1.5 1.5 0 0 1 1.5-1.5H16a5 5 0 0 0 5-5c0-4.42-4.03-8-9-8Z"></path>
                             </svg>
                         </div>
                         <h3>
-                            Serving LLMs with vLLM
+                            Serving SDXL with FastAPI
                         </h3>
 
                         <p>
-                            Serve open-source LLMs as OpenAI-compatible APIs with up to 24 times higher throughput using
-                            the
-                            <strong>vLLM</strong> library.
+                            Serving <strong>Stable Diffusion XL</strong> with <strong>FastAPI</strong> to generate
+                            and refine images via a REST endpoint.
                         </p>
                     </div>
                 </a>
 
-                <a href="text-generation-inference">
+                <a href="/examples/vllm">
                     <div class="feature-cell">
                         <div class="feature-icon">
-                            <svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 24 24">
-                                <path d="M16 9h3l-5 7m-4-7h4l-2 8M5 9h3l2 7m5-12h2l2 3h-3m-5-3h2l1 3h-4M7 4h2L8 7H5m1-5L2 8l10 14L22 8l-4-6H6Z"></path>
+                            <svg xmlns="http://www.w3.org/2000/svg" viewBox="-3 -3 27 27">
+                                <path d="m13.13 22.19-1.63-3.83c1.57-.58 3.04-1.36 4.4-2.27l-2.77 6.1M5.64 12.5l-3.83-1.63 6.1-2.77C7 9.46 6.22 10.93 5.64 12.5M21.61 2.39S16.66.269 11 5.93c-2.19 2.19-3.5 4.6-4.35 6.71-.28.75-.09 1.57.46 2.13l2.13 2.12c.55.56 1.37.74 2.12.46A19.1 19.1 0 0 0 18.07 13c5.66-5.66 3.54-10.61 3.54-10.61m-7.07 7.07c-.78-.78-.78-2.05 0-2.83s2.05-.78 2.83 0c.77.78.78 2.05 0 2.83-.78.78-2.05.78-2.83 0m-5.66 7.07-1.41-1.41 1.41 1.41M6.24 22l3.64-3.64c-.34-.09-.67-.24-.97-.45L4.83 22h1.41M2 22h1.41l4.77-4.76-1.42-1.41L2 20.59V22m0-2.83 4.09-4.08c-.21-.3-.36-.62-.45-.97L2 17.76v1.41Z"></path>
                             </svg>
                         </div>
                         <h3>
-                            Serving LLMs with TGI
+                            Serving LLMs with vLLM
                         </h3>
 
                         <p>
-                            Serve open-source LLMs as APIs with optimized performance using <strong>TGI</strong>, an
-                            open-source tool by
-                            Hugging Face.
+                            Serve open-source LLMs as OpenAI-compatible APIs with up to 24 times higher throughput using
+                            the
+                            <strong>vLLM</strong> library.
                         </p>
                     </div>
                 </a>
 
-                <a href="llmchat">
+                <a href="/examples/llmchat">
                     <div class="feature-cell">
                         <div class="feature-icon">
                             <svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 24 24">
diff --git a/docs/overrides/home.html b/docs/overrides/home.html
diff --git a/mkdocs.yml b/mkdocs.yml
diff --git a/setup.py b/setup.py