Skip to content

Releases: skypilot-org/skypilot

SkyPilot v0.10.3.post2

10 Oct 21:36
a51ed28

Choose a tag to compare

This patch release is a minor bump over v0.10.3.post1 to fix robustness against Coreweave clusters, i.e., adding additional retries and fallback for the Kubernetes API calls:

  • Fallback to file-based command execution on 431, 400 #7536 #7563
  • Retry on transiant authentication issue with error code 403 #7568 #7574

To upgrade:

pip install "skypilot==0.10.3.post2"
# Restart your local API server
sky api stop; sky api start

Or, upgrade the API server:

NAMESPACE=skypilot
RELEASE_NAME=skypilot
VERSION=0.10.3

helm repo update skypilot
helm upgrade -n $NAMESPACE $RELEASE_NAME skypilot/skypilot \
  --set apiService.image=berkeleyskypilot/skypilot:0.10.3.post2 \
  --version $VERSION --devel --reuse-values

Full Changelog: v0.10.3.post1...v0.10.3.post2

SkyPilot v0.10.3.post1

22 Sep 20:43
48dfbea

Choose a tag to compare

This patch release is a minor bump over v0.10.3 to fix a dependency issue caused by a breaking change in uvicorn==0.36.0:

  • Pin uvicorn dependency to mitigate AttributeError: 'Config' object has no attribute 'setup_event_loop' error. (#7287)

If you see an issue like above, upgrade your SkyPilot with:

pip install "skypilot==0.10.3.post1"
# Restart your local API server
sky api stop; sky api start

If you are using a remote API server, it should still work with v0.10.3.

SkyPilot v0.10.3

10 Sep 02:16
1e23b5b

Choose a tag to compare

SkyPilot v0.10.3: 2-10x Performance Improvement, Better Observability, Production-Ready Authentication and More

SkyPilot v0.10.3 delivers 2-10x performance improvement, enhanced cloud integration, improved Kubernetes integration, and strengthened production reliability features for AI/ML workloads across clouds.

Highlights

Massive Performance Improvements: 2-10x more efficient

Upgrade your SkyPilot API server to 0.10.3 and get the improvement out of the box!

Observability & Monitoring

More SkyPilot API server and cluster metrics can now be set up for better observability, including:

CoreWeave Integration: Autoscaler

CoreWeave autoscaler is now supported, with a single config field change (#6895):

kubernetes:
  autoscaler: coreweave

SkyPilot API Server: Authentication & Production Environment

SkyPilot now supports Microsoft Entra ID SSO login (#7045, #7028), besides Okta and Google Workspace. See more docs for setting up SSO login for API server: https://docs.skypilot.co/en/latest/reference/auth.html#sso-recommended
image 1

What's New

UX Improvement

  • Fix CLI auto-completion authentication issue (#6724)
  • Fixed cluster ownership display (#6989)
  • Fixed API logs for daemon request (#6841)
  • SDK improvements for tail_logs (#6902)
  • Responsive spinner for request blocking (#6905)
  • Fixed SDK authentication for download endpoints (#6955)
  • Enable log downloading in dashboard (#6999, #7000, #7003, #7010)
  • Prevented clients from setting DB strings server-side (#7042)
  • Improved logout error handling (#6874)

Storage & File Operations

  • Fixed storage mounting on ARM64 instances (#7008)
  • Improved chunk upload reliability (#6854)
  • Rsync permission fixes for Kubernetes (#6951)

Enhanced Cloud Platform Support

Kubernetes Improvements

  • Pod configuration change detection with warnings (#6912)
  • Volume name validation (#6863)
  • Worker service cleanup on termination (#7014)
  • AMD GPU support improvements

Nebius Cloud

  • Static IP configuration support (#7002)
  • Docker image ID fixes (#6894)
  • Pagination support for list_instances (#6867)
  • Customizable API domain (#6888)
  • Better network error messages (#6971)
  • Various stability improvements (#6856, #6896)

AWS

  • Elastic Fabric Adapter (EFA) automatic enablement for improved network performance (#6852)
  • AWS Systems Manager (SSM) support with new use_ssm flag for secure connections (#6830)
  • Improved VPC error messages and handling (#6746)
  • Fixed CloudWatch region setting issues (#6747)
  • Better teardown leak prevention (#7022)

RunPod

  • Volume support for RunPod network volumes (#6949)

GPU Support

  • Added B200 to common accelerators in sky show-gpus (#7006)

Documentation & Examples

  • New training examples: TorchTitan (#6677), distributed RL with game servers (#6988)
  • Best practices for network and storage benchmarking (#6632)
  • Clarified Ray runtime usage (#6783)
  • SSM documentation improvements (#7050)
  • Fixed broken documentation links (#7026, #7004)

Testing & CI/CD & Development

  • Backward compatibility tests against stable releases (#6979)
  • SSH lag unit tests (#6968)
  • Improved smoke test reliability (#6913, #6958, #6967, #7019)
  • Customizable Buildkite queue support (#7063)
  • Fixed type errors in managed jobs (#6994)

Developer Experience

  • Python 3.12 and 3.13 support (#5304, #6990)
  • Performance measurement annotations (#6943)
  • Better error handling for UV-only environments (#6893)

API Server Improvements

  • Request retention daemon fixes (#7035)
  • Better error messages for unavailable server logs (#6855)
  • Request ID header corrections (#7025)
  • Backward compatibility improvements ([#6981](#6...
Read more

SkyPilot v0.10.2

27 Aug 08:04
fc35b36

Choose a tag to compare

SkyPilot v0.10.2: Enhanced observability, programmatic SDK, performance, and more

SkyPilot v0.10.2 brings values in cluster management, improved Kubernetes support, programmatic SDK, and numerous stability and performance enhancements for production use from teams with a large number of workloads.

Get it now with:

uv pip install "skypilot>=0.10.2"

Highlights

Cluster observability improvements

Get provisioning logs with the new --provision option for sky logs (#6638):

sky logs --provision <cluster-name>

Find the detailed reason for your cluster failure, e.g., OOM, with the cluster events (#6590, #6593, #6615, #6620, #6621, #6667, #6658, #6609, #6617):

SDK: Process your logs while streaming

SkyPilot introduces a new preload_content option for tail_logs to enable processing logs while streaming.

logs = sky.tail_logs(cluster_name, job_id, follow=True, preload_content=False)
for line in logs:
    if line is not None:
        if 'needle in the haystack' in line:
            print("found it!")
            break
logs.close()

Performance Improvements

  • Job dashboard page loading speed improvement for 10k+ jobs (#6714, #6652)
  • Volume mounting on Kubernetes from 2mins to seconds (#6679)
  • Increase the connection limit from a single client: 10 to 100 (#6782)
  • Speed up sky down for AWS clusters with exposed ports by 4x (#6629, #6663, #6720)
  • Fix the OOM issue with large file_mounts (#6865)
  • Enhanced SSH connection robustness (#6715)

What's New

Infrastructure-specific Improvements

  • Kubernetes:
    • Label support for volumes (#6696)
    • Better authentication handling for in-cluster services (#6600)
    • Proxy setup guide for Kubernetes (#6618)
    • Improved error handling: Enhanced failover error messages (#6613)
    • Enable write access to conda base environment (#6766)
    • Better GPU label detection with empty labels (#6859)
  • AWS:
    • Customizable SSH users (#6625)
    • Root device name detection (#6644)
    • Support Rocky Linux (#6711)
    • Fix fractional GPU quantity for P5.48xlarge (#6722)
  • Nebius:
    • Memory configuration support (#6832): Users can now properly specify memory requirements (e.g., 64GB instances)
    • More robust cluster status fetching (#6693, #6672)
    • Enhanced VRAM calculation (#6628)
    • Improved infinite waiting fixes (#6674)
  • RunPod:
    • Docker image handling with any_of config (#6728)
    • Docker name conflict resolution (#6769)
  • Lambda: Support B200 on Lambda (#6816)
  • R2 mount issue resolution (#6662)

Dashboard Enhancements

  • Managed Jobs:
  • User Management:
    • Allow editing allowed_users and private settings for workspaces (#6566)
  • Miscs:
    • Fixed tour auto-start behavior (#6654)

API Server Enhancements

  • Robust jobs/serve controller against upgrades (#6779)
  • Prometheus metrics in deployment mode (#6712)
  • Docs for using Cloud SQL for states (#6587)
  • State management improvements:
    • Unique constraint violation detection for PostgreSQL (#6773)
    • SQLAlchemy warning suppression (#6796)
    • Reduce DB creation operations (#6594)
  • Admin Policy & Permissions & RBAC:
    • Apply admin policy for volumes (#6668, #6781, #6749)
    • Enhanced restful admin policy in testing (#6353, #6626)
    • Backward compatibility handling for admin policies (#6692)
    • Optimized permission module logging (#6805)

UX/API Improvements

  • CLI Enhancements:
    • Better managed job log tailing behavior (#6719)
    • Improved --config option documentation (#6794)
    • Better logs organization for downloading (#6795)
    • Warning for disk size specifications on Kubernetes (#6637, #6684)
    • Client version and commit info in sky api info (#6748)
    • Improved cluster event messaging and wording (#6686, #6688)
    • Better user feedback with spinner for delay messages (#6575)
    • Enhanced pod configuration validation (#6825)
  • SDK Improvements:
    • Improved transient failure handling (#6808, #6807): More intelligent retry mechanisms
    • Simpler log streaming utility (#6750): More intuitive API for processing logs in real-time with iterator-based approach
    • Better response typing (#6659, #6718, #6527)
    • Exposed sky.endpoints function (#6599)
    • Enhanced decoder backward compatibility (#6810)
    • Comprehensive endpoints documentation (#6815)
  • Robustness and Performance Improvements:
    • Fixed authentication with simplejson (#6698)
    • Reduced import statement overhead (#6641, #6645, #6648)
    • Optimized directory utilities (#6646)
    • Resolved API logging issues (#6619)
    • Enhanced retry logic for non-transient errors (#6844)
    • Better handling of missing pods with termination filtering (#6697)
    • Fixed various cluster name querying issues (#6616, #6624)
    • Improved job import handling in sky init (#6623)

Testing and Infrastructure

  • Test Improvements:
    • Automatic retry for core tests (#6710)
    • Expanded nightly build coverage (#6695)
    • Staggered test execution (#6732)
    • Remote server test support (#6819, #6823)
    • New smoke tests for provisioning logs (#6790, #6818)
    • Fixed endpoint comparison in unittests (#6578)
    • Fixed TPU test example failure (#6673)
    • Added Kubernetes volume merging unit tests (#6813)
    • Added support for custom cloud config files in smoke tests (#6851)
    • Enhanced serve status checks after termination (#6713)
    • Include Nebius VMs tests in CI/CD (#6835)
  • CI/CD Enhancements:
    • PAT token for release publishing (#6612)
    • Helm lint in pre-commit hooks (#6798, #6803)
    • Fixed nightly build triggers (#6610)
  • Helm Charts & Deployment:
    • Fixed RBAC template rendering issues (#6737, #6742)
    • Resolved ingress configuration for oauth in grafana (#6799)
    • Enhanced values.yaml documentation (#6800)
    • Better patch strategy for merging Kubernetes configs (#6653)

Other feature improvements

  • SSH Node Pool: region-specific launches (#6767)
  • SkyServe: GPU-aware Load Balancing (#6147)
    SkyServe now intelligently scale based on GPU types. Set different QPS targets across heterogeneous GPUs:
    load_balancing_policy: instance_aware_least_load
    replica_policy:
      target_qps_per_replica:
        "H100:1": 2.5    # H100 can handle 2.5 QPS
        "A100:1": 1.25   # A100 can handle 1.25 QPS
        "A10:1": 0.5     # A10 can handle 0.5 QPS

Backend

Examples and Documentation

Contributors

We thank all contributors who made this release possible!

New Contributors: @miltava, @tomzx, @webconn, @hongsu-moreh, @ibpark-moreh, @nathan-liner

All Contributors: @DanielZhangQD, @cblmemo, @kyuds, @SeungjinYang, @lloyd-brown, @romilbhardwaj, @rohansonecha, @zpoint, @aylei, @SalikovAlex, @Michaelvll, @kevinmingtarja, @miltava, @Maknee, @cg505, @tomzx, @concretevitamin, @webconn, @sethkimmel3, @andylizf, @hongsu-moreh, @ibpark-moreh, @lucamanolache, @nathan-liner

Special thanks to the community for bug reports, feature requests, and pull requests that helped improve SkyPilot!

Full Changelog

For a complete list of changes, see the commit history.

SkyPilot v0.10.1

11 Aug 05:50
0c04d75

Choose a tag to compare

SkyPilot v0.10.1: Improved production readiness, distributed training, cloud integrations, and more

SkyPilot v0.10.1 improves enterprise production readiness with large-scale distributed training capabilities (Llama 4 400B, OpenAI GPT-OSS), introduces/enhances integration with AMD GPUs and leading GPU clouds (CoreWeave, Nebius), and delivers enhanced reliability features for mission-critical AI workloads.

Get it now with:

uv pip install "skypilot>=0.10.1"

Highlights

Native Git Support

Use your (private) git repositories as your SkyPilot workdir:

# task.sky.yaml
workdir:
  url: https://github.com/my-org/my-repo.git
  ref: 1234ab # commit hash or branch name

Find your commit hash of your workdir in Dashboard:

Added in (#6294, #6257, #6268).

Distributed LLM Pretraining/Finetuning and Agentic training Examples

We released high-performance distributed training examples for large models with checkpointing support (#6525, #6551, #6242, #6273).

  • Llama 4 Maverick 400B model training on more than 16 H200 GPUs: example
  • OpenAI GPT-OSS model pretraining and finetuning: example

Agentic training example with VeRL is now also available (#6443).

Cloud and accelerator support

Native AMD GPU Support on Kubernetes

AMD ROCm is now supported on Kubernetes clusters with SkyPilot!

# task.sky.yaml
resources:
  infra: k8s/my-amd-cluster
  image_id: docker:rocm/pytorch-training:v25.6
  accelerators: MI300:4

Coreweave Integration

SkyPilot now supports CoreWeave clusters with native Infiniband and object store support (#6386, #6487, #6483).

Use your CoreWeave cluster with SkyPilot:

resources:
  infra: k8s/my-coreweave-cluster
  network: best # Enable infiniband

Nebius cloud improvement

B200 GPUs and spot instances are now supported on Nebius cloud (#6474, #6478, #6267).

# task.sky.yaml
resources:
  infra: nebius
  accelerators: B200:8
  use_spot: true

Additionally, SkyPilot now supports MOUNT_CACHED mode for Nebius cloud. (#6456)

Autostop/Autodown based on SSH sessions

In addition to running jobs on clusters, you can also ask autostop/autodown to wait for active SSH sessions or none of them in SkyPilot YAML (#6361, #6485).

# task.sky.yaml
resources:
  autostop:
    wait_for: jobs_and_ssh

External Logging

You can now dump your SkyPilot cluster/job logs to external logging services like AWS CloudWatch and GCP Cloud Logging (#6331, #6405, #6411, #6369). Configure it in your ~/.sky/config.yaml:

logs:
  store: aws  # Or 'gcp', etc.
  aws:
    ...  # Service-specific options; see below.

What's New

UX/API

  • 2x Faster sky status: Intelligent caching reduces status command latency by 50% (#6166)
  • Optimized Imports: Lazy loading of dependencies for faster CLI startup (#6302)
  • Autostop/Autodown: Improve autostop logic during cluster refresh (#6388)
  • Fast network timeout detection: Faster timeout detection for remote API server checks (#6263)
  • Better hints for parallel launch: Better hints for parallel launch (#6549)
  • Support sky api logout to logout from API server (#6284, #6327, #6412)
  • Type Checking: Enhanced mypy support with py.typed marker (#6440)

UI

  • Interactive Tour: Guided tour for new users to learn dashboard features (#6565, #6301)
  • Mobile Support: Responsive design for narrow screens (#6280)
  • YAML Syntax Highlighting: Enhanced code readability with syntax highlighting (#6404)
  • User Management: Delete users directly from the dashboard interface (#6434)
  • Display fixes: Cluster history tracking issues (#6512), VSCode remote connection (#6555), workspace resources aggregation (#6537, #6283)

Storage/Volume

  • S3 Mounting on ARM-based Instances: Full support for S3 mounting on ARM64 instances (#6255)
  • Volume Management SDK: Added SDK for creating, managing persistent volumes across clouds (#6439, #6170, #6383)
  • New examples for volumes: juiceFS, NFS, etc. (#6521)

API Server

API Server Deployment

  • Helm Chart on Artifact Hub: SkyPilot Helm chart now available on Artifact Hub (#6560, #6371)
  • More robust DB: Use Alembic for database schema migrations (#6579, #6556, #6196), thread safety improvements (#6422, #6279)
  • OAuth2 Proxy Integration: Built-in OAuth2 proxy support (#6476)
  • Redis Secret Support: Configuration for Redis URL via secrets (#6559)
  • Deployment image size reduction (#6312, #6285)
  • GPU Detection: Fixed GPU name matching with underscores (#6491)

Client Enhancements

  • Enhanced Client Reliability: Automatic retry on transient failures (#6259, #6298, #6234)
  • Robustness enhancements: lock timeout notifications (#6530), transient error retry (#6234), catalog refresh (#6272)
  • Service Account Token Auth Fix: Fixed token authentication in file uploads (#6432)
  • Issues fixes: file path escaping (#6262),

Managed Jobs

  • Force Disable Cloud Buckets: Configuration option to disable cloud bucket usage for managed jobs and SkyServe (#6402, #6407)
  • Multi-Task Log Viewing: See logs from all tasks in a managed job pipeline (#6415)

Cloud Integrations

  • GCP N4 Instance Types: Support for new N4 machine family (#6253)
  • GCP TPU Improvements: Enhanced TPU provisioning logic (#6420)
  • Hyperbolic: Marked CUSTOM_MULTI_NETWORK as unsupported (#6245)
  • RunPod: CPU instance launching support (#6450)
  • Vast: Port opening support (#6282)
  • Azure: Dependency resolution fixes (#6231)

Miscs

  • Developer Experience: Improved format.sh for code consistency (#6580)
  • Backward Compatibility: Fixed issues with Task field and version detection (#6548, #6564, #6485)
  • Test Stability: Fixed various flaky tests (#6507, #6495, #6511, #6469)
  • Release Pipeline: Fixed API version extraction and workflow issues (#6464, #6254)
  • Buildkite: Sequential test execution to reduce resource usage (#6217)

Production Deployment Guides

  • PostgreSQL with GCP Cloud SQL: Complete guide for deploying API server with Cloud SQL backend (#6325)
  • Docker Deployment: Containerized API server deployment guide (#6296)
  • AWS SSM as SSH proxy: Setup guide for AWS Systems Manager (#6522)

Contributors

We thank all contributors who made this release possible!

New Contributors**: @LokmaSpeedy, @jacobergzhou, @amd-pratmish, @tedspare, @jimbz, @makhalin

All Contributors: @alex000kim, @amd-pratmish, @andylizf, @aylei, @bikramnehra, @cblmemo, @cg505, @clayrosenthal, @concretevitamin, @DanielZhangQD, @jacobergzhou, @jimbz, @kevinmingtarja, @kyuds, @lloyd-brown, @LokmaSpeedy, @lucamanolache, @makhalin, @Maknee, @Michaelvll, @rohansonecha, @romilbhardwaj, @SalikovAlex, @SeungjinYang, @tedspare, @zhenjiasun, @zpoint

Special thanks to the community for bug reports, feature requests, and pull requests that helped improve SkyPilot!

Full Changelog

For a complete list of changes, see the commit history.

SkyPilot v0.10.0

22 Jul 06:27
61fc313

Choose a tag to compare

SkyPilot v0.10.0: Enterprise-readiness with SSO, dashboard, external PostgreSQL, workspaces and more

We are excited to announce SkyPilot 0.10! This release is the largest release by far, bringing enterprise-ready features including API server deployment in production with SSO, feature-rich dashboard, external PostgreSQL, workspace isolation and graceful upgrade, automatic network setup, and SSH Node Pools.

Get it now:

pip install -U skypilot

Highlights

API server deployment in production

Single Sign-On (SSO) and service account (Okta, Google Workspace, etc.)

SkyPilot now integrates with enterprise SSO providers like Okta, Google Workspace, enabling secure authentication with automatic account creation and access control.

image

Log in to the API server with SSO enabled:

$ sky api login -e https://skypilot.example.com
A web browser has been opened to http://skypilot.example.com/token. Please continue the login in the web browser.
To manually copy the token, press ctrl+c.

Logged into SkyPilot API server at: http://skypilot.example.com
└── Dashboard: http://skypilot.example.com/dashboard

Users authenticate via their organization's SSO provider, and their identities are automatically tracked across all SkyPilot resources.

Feature-rich Dashboard

SkyPilot dashboard now includes significant amount of new features:

  • See all infrastructure available in one page
  • Edit your SkyPilot config in dashboard
  • See and manage all your users in an organization
  • Find your GPU metrics in dashboard
  • Check your YAML/entrypoint/git commit hash for jobs
  • Find more features in Dashboard section below.
image infra-gpu

External PostgreSQL for API server in production

SkyPilot 0.10 adds support for persisting API server state to an external PostgreSQL database, enabling high availability and disaster recovery for production deployments.

Configure your deployment to use a managed database service (e.g., AWS RDS, Cloud SQL) to ensure your cluster and job state survive API server restarts or migrations.

db: postgresql://myusername:mypassword@hostname:5432/database

Workspaces: isolation and declarative configuration for teams

Workspaces provides a declarative way to define isolated environments with custom cloud configurations for different teams or projects.

Configure workspaces to control which teams can access which infrastructure:

# API server config
workspaces:
  research-private:
    private: true
    allowed_users:
      - [email protected]
      - [email protected]
    gcp:
      project_id: skypilot-research-private
    aws:
      disabled: true
  ml-team:
    gcp:
      project_id: skypilot-ml-team-prod

Teams simply set their active workspace to use their workspace configuration:

# In team's .sky.yaml
active_workspace: ml-team
image

Graceful upgrade of API server

SkyPilot 0.10 introduces robust graceful upgrade of API server:

  • Clients automatically wait for an upgrade and retries
  • Future compatibility across minor/major versions

Graceful upgrade demo

Automatic high performance network setup (network_tier: best)

SkyPilot v0.10.0 can now automatically configure high-performance network with a single network_tier: best config. Supported infra:

  • Nebius VMs
  • Nebius managed Kubernetes service
  • GCP VMs
  • Google Kubernetes Engine (GPUDirect-TCPX, GPUDirect-TCPXO, GCPDirect-RDMA for H100, and H200)

SSH Node Pools: bring your own machines

Turn your existing machines — on-premises servers, cloud reserved instances or even your personal workstation — into SSH Node Pools to run SkyPilot clusters and jobs on them.

Configure your machines in ~/.sky/ssh_node_pools.yaml:

# ~/.sky/ssh_node_pools.yaml
my-node-pool:
  hosts:
    - 1.2.3.4
    - 1.2.3.5

Deploy SkyPilot on them with a single command:

$ sky ssh up
$ sky launch --infra ssh/my-node-pool -- python train.py

Your machines now appear as infra choices alongside cloud providers, complete with GPU availability tracking and resource management.

New clouds

  • Hyperbolic cloud integration for cost-effective AI workloads (#5517)
  • Samsung Cloud Platform (SCP) support for new provisioner interface in SkyPilot (#5587)

What's new

CLI & Core interfaces

  • New --infra option to specify infrastructure instead of separate --cloud/--region/--zone flags (#5602, #5656, #5695, #5703)
    • Supports cloud providers: --infra aws/us-west-2/us-west-2a, --infra aws/*/us-west-2b
    • Kubernetes contexts: --infra k8s/my-k8s-context
    • SSH node pools: --infra ssh/my-ssh-pool
  • Secrets: define secrets in your SkyPilot YAML for secure environment variable injection (#6015, #6025)
  • GPU selection by memory size: --gpus 80GB+ (#5948)
  • Support units for disk, memory, and autostop specification (#5952, #6026)
  • Improved error handling and stacktrace display with SKYPILOT_DEBUG=1 (#6121)
  • sky cancel now supports glob patterns for cluster names (#5933)
  • Centralized log collection for tasks (#5992)
  • CLI fixes and improvements (#5915, #5811, #6213, #5798, #5871, #5729, #5880, #5590, #6228, #6030)
  • Core stability improvements (#6019, #5985, #5698, #5699, #5754, #5768, #5776, #5787, #5786, #5833, #5838, #5863, #5882, #6088, #6113)

Authentication & Security

Dashboard

API Server

  • Workspace support for team isolation (#5660, #571...
Read more

v0.9.3

20 May 08:19
c200ac1

Choose a tag to compare

What's Changed

Read more

SkyPilot v0.9.2

30 Apr 02:07
8c68500

Choose a tag to compare

This patch release is a minor bump from v0.9.1 to resolve an issue that could affect users with old clusters from v0.7 and earlier (#5439).

See the full v0.9 release notes for everything new in SkyPilot v0.9!

SkyPilot v0.9.1

24 Apr 18:04
1ffe585

Choose a tag to compare

SkyPilot v0.9.1: API Server Architecture, Web Dashboard, Faster Storage, Improved Configuration and more!

We're excited to announce the release of SkyPilot v0.9.1! This update brings major improvements to SkyPilot, making it faster, more powerful and flexible for production-ready deployment.

Highlights

Client-Server Architecture

Client-Server Architecture

The new client-server model transforms SkyPilot from a single-user system into a scalable, multi-user platform, making it easier for individuals and teams to run and manage their workloads.

  • Unified view and management: Get a single view of all running clusters and jobs across the organization and all infra you have.
  • Integrate with workflow orchestrators: SkyPilot state is centralized on the API server, does not need to be maintained in orchestrators like Airflow.
  • Multi-tenancy: Share clusters, jobs, and services securely among teammates.

More: Docs, Blog

Web dashboard

SkyPilot has a new dashboard! Easily view and manage your clusters, jobs and logs.

SkyPilot Web Dashboard

Access it with sky dashboard.

New configuration system

New configuration system

SkyPilot now supports specifying configuration at various levels: CLI, SkyPilot YAML, project-level config, client-level global config and server-side config.

You can now have a project configuration storing default values for all jobs in a project, a user configuration to apply globally to all projects and Task YAML overrides for specific jobs.

New mount_cached storage - 9.6x faster checkpointing

New storage mode mount_cached uses the local disk as a cache for cloud storage buckets. Boosts GPU utilizationby making cloud I/O asynchronous.

file_mounts:
  /checkpoints:
    source: gs://my-checkpoints-bucket
    mode: MOUNT_CACHED  # Will asynchronously upload all writes to the bucket

More: Docs, Blog

New cloud: Nebius

SkyPilot now supports Nebius cloud! Getting started is easy:

$ sky check nebius
$ sky launch --gpus H200:8 --cloud nebius

ARM instance support - run SkyPilot on GH200s, GB200s, and more!

New native images for ARM instances allows you to run SkyPilot on your GH200s, GB200s on Lambda cloud, GCP or your own Kubernetes clusters! (#4835)

What's new

CLI & Core interfaces

  • sky CLI now returns non-zero exit code on launch/exec/logs/jobs launch/jobs logs failures (#4846)
    • This improves scriptability with sky CLI in automated workflows.
  • sky check now separately checks storage and compute capabilities (#4996, #4977)
  • New --all option for sky jobs queue to show all jobs (#4923)
  • resources.gpus can now be used to alias resources.accelerators in the SkyPilot YAML (#5207)

Managed Jobs

  • Multiple users can now share the same jobs controller (#4733)
  • Autostop and autodown settings for the jobs controller can now be customized (#5182)
    # ~/.sky/config.yaml
    jobs:
    controller:
      # autostop: false  # to disable completely
      autostop:
        idle_minutes: 5
        down: true
    
  • See other users's jobs with sky jobs queue -u when using a shared controller (#4787)
  • Access to cloud object storage is no longer necessary for using file mounts or workdir in managed jobs. (#4708)
    • Running managed jobs on Kubernetes no longer requires cloud access.

Storage

  • New mode: mount_cached (#4369)
    • This mode is optimized for checkpointing large models
    • It asynchronously uploads the cached directory to the cloud storage bucket, increasing GPU utilization.
  • Fix issue with openrsync on Mac OS 15 causing uploads failures (#5196)
  • .gitignore handling is now more robust (#4988)
  • Fix exclusion for AWS bucket upload (#5128)

Kubernetes

  • Revamped /dev/fuse access mechanism on k8s (#5028)
    • We no longer need to request smarter-devices-fuse resource, making SkyPilot fuse mounting compatible on autoscaling clusters.
  • B200 GPUs are now supported on GKE (#5102)
  • Scale-to-zero autoscaling is now supported on GKE (#4935)
    • SkyPilot can now inspect the node pools available on scale-to-zero clusters before provisioning.
    • This allows SkyPilot to intelligently filter out clusters that cannot provision the requested GPU type.
  • sky check now detects and hints for unlabeled GPU nodes on GKE (#5065)
  • GPU names are now case-insensitive; numbers-only name formats are now supported (#4756, #4925)
  • Fixed fractional CPU support when using <1 CPU core (#4707)
  • Fix node filtering when provisioning multiple GPUs (#4930)
  • initContainers can now be overriden through pod_config (#5247)
  • Instructions on mounting NFS volumes (#4951)
  • GPU labelling script can now use custom context names (#5072)
  • Fixed a bug where clusters from stale contexts could not be cleaned up (#4980)

Backend

  • New Client-Server Architecture (#4660)
    • This allows SkyPilot to be deployed as a remote service shared by multiple users.
  • Fixed conda support when using python 3.12 (#4035)
  • sky exec now waits for the cluster to be started (#4867)
  • sky local up --ips now supports specifying sudo password (#5030)
  • Clouds with expired credentials are now automatically excluded from failover (#5015)

SkyServe

  • New Spot/On-demand Policy: dynamic_fallback (#4628)
    • New spot_placer field can be set to dynamic_fallback to let SkyPilot automatically switch from spot to on-demand instances if spot instances are not available.
    • More details in paper
  • Fixed: any_of field order issue causing version bump to not work (#4978)
  • Fixed: LiveError on controller (#4995)

Cloud Support

  • New cloud: Nebius (#4573, #4838)
  • GCP
    • TPU v6e is now supported on GKE clusters (#4986)
    • VPCs from different projects can be used (#5143)
    • Newer instance types (e.g., a3-highgpu-8g) can now be directly selected from the CLI with -t flag (#5120)
  • RunPod
    • Custom docker images with non-root user are now supported (#4683)
  • Lambda
    • New regions: us-east-3 and australia-east-1 (#4703, #4738)
    • Ports can now be opened on Lambda VMs (#5124)
  • Fluidstack: NVLINK GPUs are now supported (#3954)
  • IBM: new fetcher for IBM catalog (#5003)
  • Cloudflare R2: fixed upload issues when using new awscli versions (#5282)

New Examples and Tutorials

⚠️ Deprecations and removals

Removed

  • Env vars starting with SKY_ are no longer supported. Use SKYPILOT_ env vars instead.
  • Old services from 0.7.0 (before #4439) may require to be stopped and restarted.
  • kubernetes is no longer a valid region name. use the k8s context name to specify a kubernetes cluster if required.

Deprecated

  • experimental.config_overrides has been deprecated. Use the config field instead.

Migration guide

SkyPilot 0.9.1 introduces the asynchronous execution model, which may cause compatibility issues with user programs using SkyPilot SDKs <=0.8.1.

Refer to the migration guide to upgrade your code.

TL;DR: Wrap all SkyPilot SDK function calls (except tail_logs) with sky.stream_and_get() to make your program behave mostly the same as before:

# <= 0.8.1
job_id, handle = sky.launch(task)
# 0.9.1
job_id, handle = sky.stream_and_get(sky.launch(task))

Thanks to all contributors!

New contributors: @kyuds, @BorenTsai, @funkypenguin, @JiangJiaWei1103, @SalikovAlex, @flaviomartins, @ajay, @bradhilton, @SeungjinYang, @eltociear, @vvidovic, @KennBro, @DanielZhangQD

Many thanks to all contributors who contributed to this release!

Contributors: @aylei, @zpoint, @SeungjinYang, @cg505, @michaelvl...

Read more

SkyPilot v0.8.1

09 Apr 00:56
1ea3b4d

Choose a tag to compare

This patch release is a minor bump over v0.8.0 to get you the latest fixes as soon as possible.

  • Pin wheel<0.46.0 to mitigate build errors when launching clusters in environments with wheel>=0.46.0 (#5153)

Stay tuned for a major upgrade coming up in v0.9.0!