Skip to content

Commit 8552d30

Browse files
committed
merge master
2 parents 870da18 + aac9d13 commit 8552d30

File tree

137 files changed

+8309
-1671
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

137 files changed

+8309
-1671
lines changed

CONTRIBUTING.md

Lines changed: 6 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -283,18 +283,23 @@ By convention, we define API payloads in `sky/server/api/payloads.py` and there
283283
- When receiving a payload from an older version without the new field, the default value is used for the missing new field.
284284
- When receiving a payload from a newer version with a new field, the value of the new field is ignored.
285285

286-
However, when the value of the new field is taken from an user input (e.g. CLI flag), we should add a warning message to inform the user that the new field is ignored. An API version bump is required in this case. For example:
286+
However, when the value of the new field is taken from user input (e.g. CLI flag), we should warn (or throw an error) to inform the user that the new field is not supported on the current api server version. Calling `versions.get_remote_api_version()` in `sky/client/cli/command.py` will return `None` until we check the api server status which can be triggered by adding a `@server_common.check_server_healthy_or_start` decorator around the cli entry point. For example:
287+
287288

288289
```python
289290
from sky.server import versions
290291

291292
@click.option('--newflag', default=None)
293+
# Must have this or the version will be None!
294+
@server_common.check_server_healthy_or_start
292295
def cli_entry_point(newflag: Optional[str] = None):
293296
# The new flag is set but the server does not support the new field yet
294297
if newflag is not None and versions.get_remote_api_version() < 12:
295298
logger.warning('The new flag is ignored because the server does not support it yet.')
296299
```
297300

301+
We can also just check for the unsupported field in the sdk and surface the error in the cli.
302+
298303
We should also be careful when adding new fields that are not directly visible in
299304
`sky/server/api/payloads.py`, but is also being sent from the client to the server. This
300305
is mainly for validating objects from the client.

README.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -39,6 +39,7 @@
3939
----
4040

4141
:fire: *News* :fire:
42+
- [Oct 2025] Run **RL training for LLMs** with SkyRL on your Kubernetes or clouds: [**example**](./llm/skyrl/)
4243
- [Oct 2025] Train and serve [Andrej Karpathy's](https://x.com/karpathy/status/1977755427569111362) **nanochat** - the best ChatGPT that $100 can buy: [**example**](./llm/nanochat)
4344
- [Oct 2025] Run large-scale **LLM training with TorchTitan** on any AI infra: [**example**](./examples/training/torchtitan)
4445
- [Sep 2025] Scaling AI infrastructure at Abridge - **10x faster development** with SkyPilot: [**blog**](https://blog.skypilot.co/abridge/)
@@ -106,8 +107,7 @@ pip install "skypilot-nightly[kubernetes,aws,gcp,azure,oci,nebius,lambda,runpod,
106107
</p>
107108

108109
Current supported infra: Kubernetes, AWS, GCP, Azure, OCI, Nebius, Lambda Cloud, RunPod, Fluidstack,
109-
Cudo, Digital Ocean, Paperspace, Cloudflare, Samsung, IBM, Vast.ai,
110-
VMware vSphere, Seeweb.
110+
Cudo, Digital Ocean, Paperspace, Cloudflare, Samsung, IBM, Vast.ai, VMware vSphere, Seeweb, Prime Intellect.
111111
<p align="center">
112112
<picture>
113113
<source media="(prefers-color-scheme: dark)" srcset="https://raw.githubusercontent.com/skypilot-org/skypilot/master/docs/source/images/cloud-logos-dark.png">

charts/skypilot/templates/grafana-ingress.yaml

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -25,7 +25,10 @@ metadata:
2525
spec:
2626
ingressClassName: {{ .Values.grafana.ingress.ingressClassName }}
2727
rules:
28-
- http:
28+
- {{- if .Values.ingress.host }}
29+
host: {{ .Values.ingress.host | quote }}
30+
{{- end }}
31+
http:
2932
paths:
3033
- backend:
3134
service:

charts/skypilot/values.schema.json

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -384,8 +384,17 @@
384384
"config": {
385385
"type": "object",
386386
"properties": {
387+
"gzip-level": {
388+
"type": "integer"
389+
},
390+
"gzip-min-length": {
391+
"type": "integer"
392+
},
387393
"http-snippet": {
388394
"type": "string"
395+
},
396+
"use-gzip": {
397+
"type": "string"
389398
}
390399
}
391400
},

charts/skypilot/values.yaml

Lines changed: 7 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -350,6 +350,9 @@ ingress-nginx:
350350
service.beta.kubernetes.io/port_443_health-probe_protocol: "TCP"
351351
service.beta.kubernetes.io/port_80_health-probe_protocol: "TCP"
352352
config:
353+
use-gzip: "true"
354+
gzip-level: 5
355+
gzip-min-length: 1000
353356
http-snippet: |
354357
map $http_upgrade $connection_upgrade {
355358
default upgrade;
@@ -570,7 +573,10 @@ grafana:
570573
enableAuthedIngress: true
571574
path: "/grafana"
572575
ingressClassName: nginx
573-
hosts: null
576+
# @schema type: [array]; item: string
577+
# If you set hosts to null, the Grafana Helm chart will use its default value ([chart-example.local]).
578+
# To match all hosts, set hosts to an empty array ([]).
579+
hosts: []
574580
grafana.ini:
575581
server:
576582
domain: localhost

docs/source/docs/index.rst

Lines changed: 1 addition & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -206,8 +206,7 @@ SkyPilot **cuts your cloud costs & maximizes GPU availability**:
206206
SkyPilot supports your existing GPU, TPU, and CPU workloads, with no code changes.
207207

208208
Current supported infra: Kubernetes, AWS, GCP, Azure, OCI, Nebius, Lambda Cloud, RunPod, Fluidstack,
209-
Cudo, Digital Ocean, Paperspace, Cloudflare, Samsung, IBM, Vast.ai,
210-
VMware vSphere.
209+
Cudo, Digital Ocean, Paperspace, Cloudflare, Samsung, IBM, Vast.ai, VMware vSphere, Seeweb, Prime Intellect.
211210

212211
.. raw:: html
213212

docs/source/examples/managed-jobs.rst

Lines changed: 7 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -149,7 +149,7 @@ Work with managed jobs
149149

150150
For a list of all commands and options, run :code:`sky jobs --help` or read the :ref:`CLI reference <cli>`.
151151

152-
See a list of all managed jobs:
152+
See a list of managed jobs:
153153

154154
.. code-block:: console
155155
@@ -163,6 +163,8 @@ See a list of all managed jobs:
163163
2 roberta 1x [A100:8][Spot] 2 hrs ago 2h 47m 18s 2h 36m 18s 0 RUNNING
164164
1 bert-qa 1x [V100:1][Spot] 4 hrs ago 4h 24m 26s 4h 17m 54s 0 RUNNING
165165
166+
This command shows 50 managed jobs by default, use ``--limit <num>`` to show more jobs or use ``--all`` to show all jobs.
167+
166168
Stream the logs of a running managed job:
167169

168170
.. code-block:: console
@@ -615,6 +617,7 @@ Submit multiple jobs at once
615617

616618
Pools support a :code:`--num-jobs` flag to conveniently submit multiple jobs at once.
617619
Each job will be assigned a unique environment variable :code:`$SKYPILOT_JOB_RANK`, which can be used to determine the job partition.
620+
Additionally, the :code:`$SKYPILOT_NUM_JOBS` environment variable will be set to the total number of jobs submitted.
618621

619622
For example, if you have 1000 prompts to evaluate, each job can process prompts with sequence numbers
620623
:code:`$SKYPILOT_JOB_RANK * 100` to :code:`($SKYPILOT_JOB_RANK + 1) * 100`.
@@ -630,7 +633,7 @@ Here is a simple example:
630633
accelerators: {H100:1, H200:1}
631634
632635
run: |
633-
echo "Job rank: $SKYPILOT_JOB_RANK"
636+
echo "Job rank: $SKYPILOT_JOB_RANK out of $SKYPILOT_NUM_JOBS"
634637
echo "Processing prompts from $(($SKYPILOT_JOB_RANK * 100)) to $((($SKYPILOT_JOB_RANK + 1) * 100))"
635638
# Actual business logic here...
636639
echo "Job $SKYPILOT_JOB_RANK finished"
@@ -766,7 +769,7 @@ For managed jobs, SkyPilot uses an intermediate bucket to store files used in th
766769

767770
If you do not configure a bucket, SkyPilot will automatically create a temporary bucket named :code:`skypilot-filemounts-{username}-{run_id}` for each job launch. SkyPilot automatically deletes the bucket after the job completes.
768771

769-
**Object store access is not necessary to use managed jobs.** If cloud object storage is not available (e.g., Kubernetes deployments), SkyPilot automatically falls back to a two-hop upload that copies files to the jobs controller and then downloads them to the jobs.
772+
**Object store access is not necessary to use managed jobs.** If cloud object storage is not available (e.g., Kubernetes deployments), SkyPilot automatically falls back to a two-hop upload that copies files to the jobs controller and then downloads them to the jobs.
770773

771774
.. tip::
772775

@@ -1015,4 +1018,4 @@ For absolute maximum parallelism, the following per-cloud configurations are rec
10151018
.. note::
10161019
Remember to tear down your controller to apply these changes, as described above.
10171020

1018-
With this configuration, you can launch up to 512 jobs at once. Once the jobs are launched, up to 2000 jobs can be running in parallel.
1021+
With this configuration, you can launch up to 512 jobs at once. Once the jobs are launched, up to 2000 jobs can be running in parallel.

docs/source/examples/training/index.rst

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -20,4 +20,5 @@ Training
2020
Training on TPUs <tpu.md>
2121
Unsloth <unsloth.md>
2222
Verl (RLHF) <verl.md>
23+
SkyRL <skyrl.md>
2324
Vertex AI <https://medium.com/google-cloud/streamline-ai-ml-model-development-on-gke-with-skypilot-and-vertex-ai-workbench-453729a8897c>
Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
../../generated-examples/skyrl.md

docs/source/reference/api-server/examples/api-server-metrics-setup.rst

Lines changed: 15 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -35,6 +35,21 @@ chart deploy everything for you with a single command:
3535
--set prometheus.enabled=true \
3636
--set grafana.enabled=true
3737
38+
.. dropdown:: Turn off GPU metrics scraping
39+
40+
The above command also configures Prometheus to scrape the SkyPilot API server's ``/gpu-metrics`` endpoint. To disable scraping of ``/gpu-metrics``, append ``--set prometheus.extraScrapeConfigs=""`` to the Helm command:
41+
42+
.. code-block:: bash
43+
44+
helm upgrade --install skypilot skypilot/skypilot-nightly --devel \
45+
--namespace skypilot \
46+
--create-namespace \
47+
--reuse-values \
48+
--set apiService.metrics.enabled=true \
49+
--set prometheus.enabled=true \
50+
--set prometheus.extraScrapeConfigs="" \
51+
--set grafana.enabled=true
52+
3853
You can access Grafana at the ``/grafana`` endpoint:
3954

4055
.. code-block:: bash

0 commit comments

Comments
 (0)