You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: CONTRIBUTING.md
+6-1Lines changed: 6 additions & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -283,18 +283,23 @@ By convention, we define API payloads in `sky/server/api/payloads.py` and there
283
283
- When receiving a payload from an older version without the new field, the default value is used for the missing new field.
284
284
- When receiving a payload from a newer version with a new field, the value of the new field is ignored.
285
285
286
-
However, when the value of the new field is taken from an user input (e.g. CLI flag), we should add a warning message to inform the user that the new field is ignored. An API version bump is required in this case. For example:
286
+
However, when the value of the new field is taken from user input (e.g. CLI flag), we should warn (or throw an error) to inform the user that the new field isnot supported on the current api server version. Calling `versions.get_remote_api_version()`in`sky/client/cli/command.py` will return`None` until we check the api server status which can be triggered by adding a `@server_common.check_server_healthy_or_start` decorator around the cli entry point. For example:
287
+
287
288
288
289
```python
289
290
from sky.server import versions
290
291
291
292
@click.option('--newflag', default=None)
293
+
# Must have this or the version will be None!
294
+
@server_common.check_server_healthy_or_start
292
295
defcli_entry_point(newflag: Optional[str] =None):
293
296
# The new flag is set but the server does not support the new field yet
294
297
if newflag isnotNoneand versions.get_remote_api_version() <12:
295
298
logger.warning('The new flag is ignored because the server does not support it yet.')
296
299
```
297
300
301
+
We can also just check for the unsupported field in the sdk and surface the error in the cli.
302
+
298
303
We should also be careful when adding new fields that are not directly visible in
299
304
`sky/server/api/payloads.py`, but is also being sent from the client to the server. This
Copy file name to clipboardExpand all lines: README.md
+2-2Lines changed: 2 additions & 2 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -39,6 +39,7 @@
39
39
----
40
40
41
41
:fire:*News*:fire:
42
+
-[Oct 2025] Run **RL training for LLMs** with SkyRL on your Kubernetes or clouds: [**example**](./llm/skyrl/)
42
43
-[Oct 2025] Train and serve [Andrej Karpathy's](https://x.com/karpathy/status/1977755427569111362)**nanochat** - the best ChatGPT that $100 can buy: [**example**](./llm/nanochat)
43
44
-[Oct 2025] Run large-scale **LLM training with TorchTitan** on any AI infra: [**example**](./examples/training/torchtitan)
44
45
-[Sep 2025] Scaling AI infrastructure at Abridge - **10x faster development** with SkyPilot: [**blog**](https://blog.skypilot.co/abridge/)
This command shows 50 managed jobs by default, use ``--limit <num>`` to show more jobs or use ``--all`` to show all jobs.
167
+
166
168
Stream the logs of a running managed job:
167
169
168
170
.. code-block:: console
@@ -615,6 +617,7 @@ Submit multiple jobs at once
615
617
616
618
Pools support a :code:`--num-jobs` flag to conveniently submit multiple jobs at once.
617
619
Each job will be assigned a unique environment variable :code:`$SKYPILOT_JOB_RANK`, which can be used to determine the job partition.
620
+
Additionally, the :code:`$SKYPILOT_NUM_JOBS` environment variable will be set to the total number of jobs submitted.
618
621
619
622
For example, if you have 1000 prompts to evaluate, each job can process prompts with sequence numbers
620
623
:code:`$SKYPILOT_JOB_RANK * 100` to :code:`($SKYPILOT_JOB_RANK + 1) * 100`.
@@ -630,7 +633,7 @@ Here is a simple example:
630
633
accelerators: {H100:1, H200:1}
631
634
632
635
run: |
633
-
echo "Job rank: $SKYPILOT_JOB_RANK"
636
+
echo "Job rank: $SKYPILOT_JOB_RANK out of $SKYPILOT_NUM_JOBS"
634
637
echo "Processing prompts from $(($SKYPILOT_JOB_RANK * 100)) to $((($SKYPILOT_JOB_RANK + 1) * 100))"
635
638
# Actual business logic here...
636
639
echo "Job $SKYPILOT_JOB_RANK finished"
@@ -766,7 +769,7 @@ For managed jobs, SkyPilot uses an intermediate bucket to store files used in th
766
769
767
770
If you do not configure a bucket, SkyPilot will automatically create a temporary bucket named :code:`skypilot-filemounts-{username}-{run_id}` for each job launch. SkyPilot automatically deletes the bucket after the job completes.
768
771
769
-
**Object store access is not necessary to use managed jobs.** If cloud object storage is not available (e.g., Kubernetes deployments), SkyPilot automatically falls back to a two-hop upload that copies files to the jobs controller and then downloads them to the jobs.
772
+
**Object store access is not necessary to use managed jobs.** If cloud object storage is not available (e.g., Kubernetes deployments), SkyPilot automatically falls back to a two-hop upload that copies files to the jobs controller and then downloads them to the jobs.
770
773
771
774
.. tip::
772
775
@@ -1015,4 +1018,4 @@ For absolute maximum parallelism, the following per-cloud configurations are rec
1015
1018
.. note::
1016
1019
Remember to tear down your controller to apply these changes, as described above.
1017
1020
1018
-
With this configuration, you can launch up to 512 jobs at once. Once the jobs are launched, up to 2000 jobs can be running in parallel.
1021
+
With this configuration, you can launch up to 512 jobs at once. Once the jobs are launched, up to 2000 jobs can be running in parallel.
Copy file name to clipboardExpand all lines: docs/source/reference/api-server/examples/api-server-metrics-setup.rst
+15Lines changed: 15 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -35,6 +35,21 @@ chart deploy everything for you with a single command:
35
35
--set prometheus.enabled=true \
36
36
--set grafana.enabled=true
37
37
38
+
.. dropdown:: Turn off GPU metrics scraping
39
+
40
+
The above command also configures Prometheus to scrape the SkyPilot API server's ``/gpu-metrics`` endpoint. To disable scraping of ``/gpu-metrics``, append ``--set prometheus.extraScrapeConfigs=""`` to the Helm command:
0 commit comments