Releases: skypilot-org/skypilot
SkyPilot v0.10.3.post2
This patch release is a minor bump over v0.10.3.post1 to fix robustness against Coreweave clusters, i.e., adding additional retries and fallback for the Kubernetes API calls:
- Fallback to file-based command execution on 431, 400 #7536 #7563
- Retry on transiant authentication issue with error code 403 #7568 #7574
To upgrade:
pip install "skypilot==0.10.3.post2"
# Restart your local API server
sky api stop; sky api start
Or, upgrade the API server:
NAMESPACE=skypilot
RELEASE_NAME=skypilot
VERSION=0.10.3
helm repo update skypilot
helm upgrade -n $NAMESPACE $RELEASE_NAME skypilot/skypilot \
--set apiService.image=berkeleyskypilot/skypilot:0.10.3.post2 \
--version $VERSION --devel --reuse-values
Full Changelog: v0.10.3.post1...v0.10.3.post2
SkyPilot v0.10.3.post1
This patch release is a minor bump over v0.10.3 to fix a dependency issue caused by a breaking change in uvicorn==0.36.0:
- Pin
uvicorndependency to mitigateAttributeError: 'Config' object has no attribute 'setup_event_loop'error. (#7287)
If you see an issue like above, upgrade your SkyPilot with:
pip install "skypilot==0.10.3.post1"
# Restart your local API server
sky api stop; sky api startIf you are using a remote API server, it should still work with v0.10.3.
SkyPilot v0.10.3
SkyPilot v0.10.3: 2-10x Performance Improvement, Better Observability, Production-Ready Authentication and More
SkyPilot v0.10.3 delivers 2-10x performance improvement, enhanced cloud integration, improved Kubernetes integration, and strengthened production reliability features for AI/ML workloads across clouds.
Highlights
Massive Performance Improvements: 2-10x more efficient
- 4x faster
sky status(#6883, #6858, #6868, #6871, #6882, #6940, #6948, #6892, #6908) - 2-10x speedup for dashboard pages, including users, infra, volumes (#7031, #7033, #6959, #7013, #7011, #7012)
- 2x speedup for remote PostgreSQL environments through connection pooling (#6998, #6897)
- Fixes request responsiveness and SSH lagging/disconnection issue in production (#6984, #6983, , #6889, #6963, #6947, #7015, #6966, #6938, #6957)
- Improved API server memory management (#7046, #7037, #6909)
Upgrade your SkyPilot API server to 0.10.3 and get the improvement out of the box!
Observability & Monitoring
More SkyPilot API server and cluster metrics can now be set up for better observability, including:
- Detailed cluster events (#6907, #6906, #6901, #6899, #6900, #6936, #6875)
- API server detailed metrics for request latency, event loop, and process resource usage (#7017, #6935, #6942, #6968, #7062)

- Request ID tracking for better debugging (#6933)
CoreWeave Integration: Autoscaler
CoreWeave autoscaler is now supported, with a single config field change (#6895):
kubernetes:
autoscaler: coreweaveSkyPilot API Server: Authentication & Production Environment
SkyPilot now supports Microsoft Entra ID SSO login (#7045, #7028), besides Okta and Google Workspace. See more docs for setting up SSO login for API server: https://docs.skypilot.co/en/latest/reference/auth.html#sso-recommended

What's New
UX Improvement
- Fix CLI auto-completion authentication issue (#6724)
- Fixed cluster ownership display (#6989)
- Fixed API logs for daemon request (#6841)
- SDK improvements for
tail_logs(#6902) - Responsive spinner for request blocking (#6905)
- Fixed SDK authentication for download endpoints (#6955)
- Enable log downloading in dashboard (#6999, #7000, #7003, #7010)
- Prevented clients from setting DB strings server-side (#7042)
- Improved logout error handling (#6874)
Storage & File Operations
- Fixed storage mounting on ARM64 instances (#7008)
- Improved chunk upload reliability (#6854)
- Rsync permission fixes for Kubernetes (#6951)
Enhanced Cloud Platform Support
Kubernetes Improvements
- Pod configuration change detection with warnings (#6912)
- Volume name validation (#6863)
- Worker service cleanup on termination (#7014)
- AMD GPU support improvements
Nebius Cloud
- Static IP configuration support (#7002)
- Docker image ID fixes (#6894)
- Pagination support for
list_instances(#6867) - Customizable API domain (#6888)
- Better network error messages (#6971)
- Various stability improvements (#6856, #6896)
AWS
- Elastic Fabric Adapter (EFA) automatic enablement for improved network performance (#6852)
- AWS Systems Manager (SSM) support with new
use_ssmflag for secure connections (#6830) - Improved VPC error messages and handling (#6746)
- Fixed CloudWatch region setting issues (#6747)
- Better teardown leak prevention (#7022)
RunPod
- Volume support for RunPod network volumes (#6949)
GPU Support
- Added B200 to common accelerators in
sky show-gpus(#7006)
Documentation & Examples
- New training examples: TorchTitan (#6677), distributed RL with game servers (#6988)
- Best practices for network and storage benchmarking (#6632)
- Clarified Ray runtime usage (#6783)
- SSM documentation improvements (#7050)
- Fixed broken documentation links (#7026, #7004)
Testing & CI/CD & Development
- Backward compatibility tests against stable releases (#6979)
- SSH lag unit tests (#6968)
- Improved smoke test reliability (#6913, #6958, #6967, #7019)
- Customizable Buildkite queue support (#7063)
- Fixed type errors in managed jobs (#6994)
Developer Experience
- Python 3.12 and 3.13 support (#5304, #6990)
- Performance measurement annotations (#6943)
- Better error handling for UV-only environments (#6893)
API Server Improvements
SkyPilot v0.10.2
SkyPilot v0.10.2: Enhanced observability, programmatic SDK, performance, and more
SkyPilot v0.10.2 brings values in cluster management, improved Kubernetes support, programmatic SDK, and numerous stability and performance enhancements for production use from teams with a large number of workloads.
Get it now with:
uv pip install "skypilot>=0.10.2"
Highlights
Cluster observability improvements
Get provisioning logs with the new --provision option for sky logs (#6638):
sky logs --provision <cluster-name>Find the detailed reason for your cluster failure, e.g., OOM, with the cluster events (#6590, #6593, #6615, #6620, #6621, #6667, #6658, #6609, #6617):
SDK: Process your logs while streaming
SkyPilot introduces a new preload_content option for tail_logs to enable processing logs while streaming.
logs = sky.tail_logs(cluster_name, job_id, follow=True, preload_content=False)
for line in logs:
if line is not None:
if 'needle in the haystack' in line:
print("found it!")
break
logs.close()Performance Improvements
- Job dashboard page loading speed improvement for 10k+ jobs (#6714, #6652)
- Volume mounting on Kubernetes from 2mins to seconds (#6679)
- Increase the connection limit from a single client: 10 to 100 (#6782)
- Speed up
sky downfor AWS clusters with exposed ports by 4x (#6629, #6663, #6720) - Fix the OOM issue with large file_mounts (#6865)
- Enhanced SSH connection robustness (#6715)
What's New
Infrastructure-specific Improvements
- Kubernetes:
- Label support for volumes (#6696)
- Better authentication handling for in-cluster services (#6600)
- Proxy setup guide for Kubernetes (#6618)
- Improved error handling: Enhanced failover error messages (#6613)
- Enable write access to conda base environment (#6766)
- Better GPU label detection with empty labels (#6859)
- AWS:
- Nebius:
- RunPod:
- Lambda: Support B200 on Lambda (#6816)
- R2 mount issue resolution (#6662)
Dashboard Enhancements
- Managed Jobs:
- User Management:
- Allow editing allowed_users and private settings for workspaces (#6566)
- Miscs:
- Fixed tour auto-start behavior (#6654)
API Server Enhancements
- Robust jobs/serve controller against upgrades (#6779)
- Prometheus metrics in deployment mode (#6712)
- Docs for using Cloud SQL for states (#6587)
- State management improvements:
- Admin Policy & Permissions & RBAC:
UX/API Improvements
- CLI Enhancements:
- Better managed job log tailing behavior (#6719)
- Improved
--configoption documentation (#6794) - Better logs organization for downloading (#6795)
- Warning for disk size specifications on Kubernetes (#6637, #6684)
- Client version and commit info in
sky api info(#6748) - Improved cluster event messaging and wording (#6686, #6688)
- Better user feedback with spinner for delay messages (#6575)
- Enhanced pod configuration validation (#6825)
- SDK Improvements:
- Improved transient failure handling (#6808, #6807): More intelligent retry mechanisms
- Simpler log streaming utility (#6750): More intuitive API for processing logs in real-time with iterator-based approach
- Better response typing (#6659, #6718, #6527)
- Exposed
sky.endpointsfunction (#6599) - Enhanced decoder backward compatibility (#6810)
- Comprehensive endpoints documentation (#6815)
- Robustness and Performance Improvements:
- Fixed authentication with simplejson (#6698)
- Reduced import statement overhead (#6641, #6645, #6648)
- Optimized directory utilities (#6646)
- Resolved API logging issues (#6619)
- Enhanced retry logic for non-transient errors (#6844)
- Better handling of missing pods with termination filtering (#6697)
- Fixed various cluster name querying issues (#6616, #6624)
- Improved job import handling in sky init (#6623)
Testing and Infrastructure
- Test Improvements:
- Automatic retry for core tests (#6710)
- Expanded nightly build coverage (#6695)
- Staggered test execution (#6732)
- Remote server test support (#6819, #6823)
- New smoke tests for provisioning logs (#6790, #6818)
- Fixed endpoint comparison in unittests (#6578)
- Fixed TPU test example failure (#6673)
- Added Kubernetes volume merging unit tests (#6813)
- Added support for custom cloud config files in smoke tests (#6851)
- Enhanced serve status checks after termination (#6713)
- Include Nebius VMs tests in CI/CD (#6835)
- CI/CD Enhancements:
- Helm Charts & Deployment:
Other feature improvements
- SSH Node Pool: region-specific launches (#6767)
- SkyServe: GPU-aware Load Balancing (#6147)
SkyServe now intelligently scale based on GPU types. Set different QPS targets across heterogeneous GPUs:load_balancing_policy: instance_aware_least_load replica_policy: target_qps_per_replica: "H100:1": 2.5 # H100 can handle 2.5 QPS "A100:1": 1.25 # A100 can handle 1.25 QPS "A10:1": 0.5 # A10 can handle 0.5 QPS
Backend
- Alpha support for replacing skylet with grpc (#6771, #6754, #6788, #6574, #6640, #6643, #6649, #6850)
- Alpha support for Managed Jobs Pool (#6665, #6571, #6614, #6676, #6580, #6682, #6675, #6785, #6591, #6666)
- Enhanced event querying by cluster hash (#6617)
- Resolved unpickle issues with class location changes (#6731)
- Volume resolution before DAG application (#)
- Fixed log cancellation issues (#6801)
Examples and Documentation
- RAG example improvements for embeddings (#6639)
- Kimi K2 multi-node serving example (#6706, #6723)
- Enhanced autostop documentation (#6577)
- Two-hop file transfer guide for managed jobs (#6576)
- SDK examples in Quick Start (#6561, #6597)
- Polished landing page design (#6756)
- Miscs (#6741, #6827, #6828, #6586, #6774)
Contributors
We thank all contributors who made this release possible!
New Contributors: @miltava, @tomzx, @webconn, @hongsu-moreh, @ibpark-moreh, @nathan-liner
All Contributors: @DanielZhangQD, @cblmemo, @kyuds, @SeungjinYang, @lloyd-brown, @romilbhardwaj, @rohansonecha, @zpoint, @aylei, @SalikovAlex, @Michaelvll, @kevinmingtarja, @miltava, @Maknee, @cg505, @tomzx, @concretevitamin, @webconn, @sethkimmel3, @andylizf, @hongsu-moreh, @ibpark-moreh, @lucamanolache, @nathan-liner
Special thanks to the community for bug reports, feature requests, and pull requests that helped improve SkyPilot!
Full Changelog
For a complete list of changes, see the commit history.
SkyPilot v0.10.1
SkyPilot v0.10.1: Improved production readiness, distributed training, cloud integrations, and more
SkyPilot v0.10.1 improves enterprise production readiness with large-scale distributed training capabilities (Llama 4 400B, OpenAI GPT-OSS), introduces/enhances integration with AMD GPUs and leading GPU clouds (CoreWeave, Nebius), and delivers enhanced reliability features for mission-critical AI workloads.
Get it now with:
uv pip install "skypilot>=0.10.1"
Highlights
Native Git Support
Use your (private) git repositories as your SkyPilot workdir:
# task.sky.yaml
workdir:
url: https://github.com/my-org/my-repo.git
ref: 1234ab # commit hash or branch nameFind your commit hash of your workdir in Dashboard:
Added in (#6294, #6257, #6268).
Distributed LLM Pretraining/Finetuning and Agentic training Examples
We released high-performance distributed training examples for large models with checkpointing support (#6525, #6551, #6242, #6273).
- Llama 4 Maverick 400B model training on more than 16 H200 GPUs: example
- OpenAI GPT-OSS model pretraining and finetuning: example
Agentic training example with VeRL is now also available (#6443).
Cloud and accelerator support
Native AMD GPU Support on Kubernetes
AMD ROCm is now supported on Kubernetes clusters with SkyPilot!
# task.sky.yaml
resources:
infra: k8s/my-amd-cluster
image_id: docker:rocm/pytorch-training:v25.6
accelerators: MI300:4Coreweave Integration
SkyPilot now supports CoreWeave clusters with native Infiniband and object store support (#6386, #6487, #6483).
Use your CoreWeave cluster with SkyPilot:
resources:
infra: k8s/my-coreweave-cluster
network: best # Enable infinibandNebius cloud improvement
B200 GPUs and spot instances are now supported on Nebius cloud (#6474, #6478, #6267).
# task.sky.yaml
resources:
infra: nebius
accelerators: B200:8
use_spot: trueAdditionally, SkyPilot now supports MOUNT_CACHED mode for Nebius cloud. (#6456)
Autostop/Autodown based on SSH sessions
In addition to running jobs on clusters, you can also ask autostop/autodown to wait for active SSH sessions or none of them in SkyPilot YAML (#6361, #6485).
# task.sky.yaml
resources:
autostop:
wait_for: jobs_and_sshExternal Logging
You can now dump your SkyPilot cluster/job logs to external logging services like AWS CloudWatch and GCP Cloud Logging (#6331, #6405, #6411, #6369). Configure it in your ~/.sky/config.yaml:
logs:
store: aws # Or 'gcp', etc.
aws:
... # Service-specific options; see below.What's New
UX/API
- 2x Faster
sky status: Intelligent caching reduces status command latency by 50% (#6166) - Optimized Imports: Lazy loading of dependencies for faster CLI startup (#6302)
- Autostop/Autodown: Improve autostop logic during cluster refresh (#6388)
- Fast network timeout detection: Faster timeout detection for remote API server checks (#6263)
- Better hints for parallel launch: Better hints for parallel launch (#6549)
- Support
sky api logoutto logout from API server (#6284, #6327, #6412) - Type Checking: Enhanced mypy support with py.typed marker (#6440)
UI
- Interactive Tour: Guided tour for new users to learn dashboard features (#6565, #6301)
- Mobile Support: Responsive design for narrow screens (#6280)
- YAML Syntax Highlighting: Enhanced code readability with syntax highlighting (#6404)
- User Management: Delete users directly from the dashboard interface (#6434)
- Display fixes: Cluster history tracking issues (#6512), VSCode remote connection (#6555), workspace resources aggregation (#6537, #6283)
Storage/Volume
- S3 Mounting on ARM-based Instances: Full support for S3 mounting on ARM64 instances (#6255)
- Volume Management SDK: Added SDK for creating, managing persistent volumes across clouds (#6439, #6170, #6383)
- New examples for volumes: juiceFS, NFS, etc. (#6521)
API Server
API Server Deployment
- Helm Chart on Artifact Hub: SkyPilot Helm chart now available on Artifact Hub (#6560, #6371)
- More robust DB: Use Alembic for database schema migrations (#6579, #6556, #6196), thread safety improvements (#6422, #6279)
- OAuth2 Proxy Integration: Built-in OAuth2 proxy support (#6476)
- Redis Secret Support: Configuration for Redis URL via secrets (#6559)
- Deployment image size reduction (#6312, #6285)
- GPU Detection: Fixed GPU name matching with underscores (#6491)
Client Enhancements
- Enhanced Client Reliability: Automatic retry on transient failures (#6259, #6298, #6234)
- Robustness enhancements: lock timeout notifications (#6530), transient error retry (#6234), catalog refresh (#6272)
- Service Account Token Auth Fix: Fixed token authentication in file uploads (#6432)
- Issues fixes: file path escaping (#6262),
Managed Jobs
- Force Disable Cloud Buckets: Configuration option to disable cloud bucket usage for managed jobs and SkyServe (#6402, #6407)
- Multi-Task Log Viewing: See logs from all tasks in a managed job pipeline (#6415)
Cloud Integrations
- GCP N4 Instance Types: Support for new N4 machine family (#6253)
- GCP TPU Improvements: Enhanced TPU provisioning logic (#6420)
- Hyperbolic: Marked CUSTOM_MULTI_NETWORK as unsupported (#6245)
- RunPod: CPU instance launching support (#6450)
- Vast: Port opening support (#6282)
- Azure: Dependency resolution fixes (#6231)
Miscs
- Developer Experience: Improved format.sh for code consistency (#6580)
- Backward Compatibility: Fixed issues with Task field and version detection (#6548, #6564, #6485)
- Test Stability: Fixed various flaky tests (#6507, #6495, #6511, #6469)
- Release Pipeline: Fixed API version extraction and workflow issues (#6464, #6254)
- Buildkite: Sequential test execution to reduce resource usage (#6217)
Production Deployment Guides
- PostgreSQL with GCP Cloud SQL: Complete guide for deploying API server with Cloud SQL backend (#6325)
- Docker Deployment: Containerized API server deployment guide (#6296)
- AWS SSM as SSH proxy: Setup guide for AWS Systems Manager (#6522)
Contributors
We thank all contributors who made this release possible!
New Contributors**: @LokmaSpeedy, @jacobergzhou, @amd-pratmish, @tedspare, @jimbz, @makhalin
All Contributors: @alex000kim, @amd-pratmish, @andylizf, @aylei, @bikramnehra, @cblmemo, @cg505, @clayrosenthal, @concretevitamin, @DanielZhangQD, @jacobergzhou, @jimbz, @kevinmingtarja, @kyuds, @lloyd-brown, @LokmaSpeedy, @lucamanolache, @makhalin, @Maknee, @Michaelvll, @rohansonecha, @romilbhardwaj, @SalikovAlex, @SeungjinYang, @tedspare, @zhenjiasun, @zpoint
Special thanks to the community for bug reports, feature requests, and pull requests that helped improve SkyPilot!
Full Changelog
For a complete list of changes, see the commit history.
SkyPilot v0.10.0
SkyPilot v0.10.0: Enterprise-readiness with SSO, dashboard, external PostgreSQL, workspaces and more
We are excited to announce SkyPilot 0.10! This release is the largest release by far, bringing enterprise-ready features including API server deployment in production with SSO, feature-rich dashboard, external PostgreSQL, workspace isolation and graceful upgrade, automatic network setup, and SSH Node Pools.
Get it now:
pip install -U skypilot- Highlights
- What's new
⚠️ Deprecations and removals- Migration guide
- Get started today
Highlights
API server deployment in production
Single Sign-On (SSO) and service account (Okta, Google Workspace, etc.)
SkyPilot now integrates with enterprise SSO providers like Okta, Google Workspace, enabling secure authentication with automatic account creation and access control.
Log in to the API server with SSO enabled:
$ sky api login -e https://skypilot.example.com
A web browser has been opened to http://skypilot.example.com/token. Please continue the login in the web browser.
To manually copy the token, press ctrl+c.
Logged into SkyPilot API server at: http://skypilot.example.com
└── Dashboard: http://skypilot.example.com/dashboardUsers authenticate via their organization's SSO provider, and their identities are automatically tracked across all SkyPilot resources.
Feature-rich Dashboard
SkyPilot dashboard now includes significant amount of new features:
- See all infrastructure available in one page
- Edit your SkyPilot config in dashboard
- See and manage all your users in an organization
- Find your GPU metrics in dashboard
- Check your YAML/entrypoint/git commit hash for jobs
- Find more features in Dashboard section below.
External PostgreSQL for API server in production
SkyPilot 0.10 adds support for persisting API server state to an external PostgreSQL database, enabling high availability and disaster recovery for production deployments.
Configure your deployment to use a managed database service (e.g., AWS RDS, Cloud SQL) to ensure your cluster and job state survive API server restarts or migrations.
db: postgresql://myusername:mypassword@hostname:5432/databaseWorkspaces: isolation and declarative configuration for teams
Workspaces provides a declarative way to define isolated environments with custom cloud configurations for different teams or projects.
Configure workspaces to control which teams can access which infrastructure:
# API server config
workspaces:
research-private:
private: true
allowed_users:
- [email protected]
- [email protected]
gcp:
project_id: skypilot-research-private
aws:
disabled: true
ml-team:
gcp:
project_id: skypilot-ml-team-prod
Teams simply set their active workspace to use their workspace configuration:
# In team's .sky.yaml
active_workspace: ml-team
Graceful upgrade of API server
SkyPilot 0.10 introduces robust graceful upgrade of API server:
- Clients automatically wait for an upgrade and retries
- Future compatibility across minor/major versions
Automatic high performance network setup (network_tier: best)
SkyPilot v0.10.0 can now automatically configure high-performance network with a single network_tier: best config. Supported infra:
- Nebius VMs
- Nebius managed Kubernetes service
- GCP VMs
- Google Kubernetes Engine (GPUDirect-TCPX, GPUDirect-TCPXO, GCPDirect-RDMA for H100, and H200)
SSH Node Pools: bring your own machines
Turn your existing machines — on-premises servers, cloud reserved instances or even your personal workstation — into SSH Node Pools to run SkyPilot clusters and jobs on them.
Configure your machines in ~/.sky/ssh_node_pools.yaml:
# ~/.sky/ssh_node_pools.yaml
my-node-pool:
hosts:
- 1.2.3.4
- 1.2.3.5Deploy SkyPilot on them with a single command:
$ sky ssh up
$ sky launch --infra ssh/my-node-pool -- python train.pyYour machines now appear as infra choices alongside cloud providers, complete with GPU availability tracking and resource management.
New clouds
- Hyperbolic cloud integration for cost-effective AI workloads (#5517)
- Samsung Cloud Platform (SCP) support for new provisioner interface in SkyPilot (#5587)
What's new
CLI & Core interfaces
- New
--infraoption to specify infrastructure instead of separate--cloud/--region/--zoneflags (#5602, #5656, #5695, #5703)- Supports cloud providers:
--infra aws/us-west-2/us-west-2a,--infra aws/*/us-west-2b - Kubernetes contexts:
--infra k8s/my-k8s-context - SSH node pools:
--infra ssh/my-ssh-pool
- Supports cloud providers:
- Secrets: define secrets in your SkyPilot YAML for secure environment variable injection (#6015, #6025)
- GPU selection by memory size:
--gpus 80GB+(#5948) - Support units for disk, memory, and autostop specification (#5952, #6026)
- Improved error handling and stacktrace display with
SKYPILOT_DEBUG=1(#6121) sky cancelnow supports glob patterns for cluster names (#5933)- Centralized log collection for tasks (#5992)
- CLI fixes and improvements (#5915, #5811, #6213, #5798, #5871, #5729, #5880, #5590, #6228, #6030)
- Core stability improvements (#6019, #5985, #5698, #5699, #5754, #5768, #5776, #5787, #5786, #5833, #5838, #5863, #5882, #6088, #6113)
Authentication & Security
- Single Sign-On (SSO) integration with OAuth2 providers (#5640, #5641, #5651, #5684, #5717, #5721, #5758, #5759, #5781, #5817, #6164)
- Service account support for programmatic access (#6077)
- Role-Based Access Control (RBAC) (#5872, #5938)
- RESTful Admin policies for policy enforcement (#5927, #6089, #5940)
- Enhanced authentication security (#5816, #6119, #6037, #6141, #6152, #6120, #5877)
Dashboard
- User management and filtering (#5936, #6078, #5708, #5722, #5997, #6183, #5935)
- Cluster and jobs history tracking (#6041, #5944)
- New Infra page (#5623, #5788)
- Configuration editing (#5748, #5770)
- GPU and API server metrics (#5907, #6116, #6092, #6139)
- YAML/entrypoint/git commit hash display for clusters and jobs (#5813, #5906, #5900)
- Real-time log streaming for managed jobs (#5808)
- Loading speed improvements (#5777, #5825)
- API server version display (#5784, #5827)
- Enhanced search and filtering capabilities (#5997, #6183)
- Display improvements (#5904, #5822, #6193)
API Server
v0.9.3
What's Changed
- [Core] Do not initialize conda for users if using docker image by @SeungjinYang in #5303
- [AWS] Use consistent list for instance termination logic by @SeungjinYang in #5316
- Fix: validate dag should not block uvicorn by @aylei in #5328
- [UX] Rename file to match import path by @DanielZhangQD in #5349
- update installation docs and Dockerfile to reflect omegaconf by @cg505 in #5356
- add typechecking for boto clients/resources by @cg505 in #5319
- [GCP] add helptext about API srv missing
gcloudwhen installed using wget by @SeungjinYang in #5335 - support
all_usersin wait for job status in back compact tests by @zpoint in #5346 - fixes from #5335 by @SeungjinYang in #5361
- [k8s] Hints for querying stale kube current context by @kyuds in #5273
- [Docs] Remove meetup announcement banner by @romilbhardwaj in #5364
- API version schema check for release pipeline by @zpoint in #5309
- [GCP] Support hyperdisk-balanced for a3 series by @JiangJiaWei1103 in #5351
- [k8s] Better error message for stale jobs controller by @kyuds in #5274
- [GH] Fix release action by @Michaelvll in #5373
- [RunPod] Use zone to provision in a specific data center ID by @Kovbo in #5166
- [UX] remove references to SKYPILOT_GLOBAL_CONFIG envvar in docs by @SeungjinYang in #5374
- [aws] fix logic to detect rule with all traffic from security group by @SeungjinYang in #5332
- [UX] Remove credentials from the dashboard URL and update the dashboard build hint by @DanielZhangQD in #5363
- supress warning by setting default value of
asyncio_default_fixture_loop_scopeby @zpoint in #5348 - [Docs] Fix migration guide version by @romilbhardwaj in #5369
- update AWS credential setup docs, consolidate cloud auth docs by @cg505 in #5122
- [k8s] Force terminate misbehaving pods by @romilbhardwaj in #5370
- Fix broken test
test_gcp_disk_tierby @zpoint in #5393 - Remove
pytest.inito remove test warning by @zpoint in #5379 - Refine API server deployment doc by @aylei in #5295
- [Docs] Add API server tuning guide by @aylei in #5176
- Introduce High Availability Service Controller by @andylizf in #4564
- [API server] Fix worker number for non-local low resource env by @aylei in #5409
- [k8s] Fix IPv6 SSH by @kyuds in #5413
- Fix terminating k8s cluster by @aylei in #5412
- [UX]
api info: display dashboard on last line by @concretevitamin in #5417 - [UX] Minor fix. by @concretevitamin in #5420
- [config] remove omegaconf as dependency by @SeungjinYang in #5375
- [k8s] idea: allow an accelerator to map to multiple label values by @SeungjinYang in #5343
- [Nebius] Don't cache session across multiple requests by @SalikovAlex in #5347
- [Nebius] Add Docker support for Nebius cloud by @SalikovAlex in #5334
- Qwen3 235b example by @Michaelvll in #5425
- [UX][k8s] show-gpus for all allowed contexts by @kyuds in #5362
- [API server] make server config conherent by @aylei in #5414
- [Catalog] use v7 for latest runpod by @Michaelvll in #5422
- [UX] Update dashboard favicon with transparent background by @DanielZhangQD in #5426
- [Core][RunPod] Show error for RunPod multi-node by @kyuds in #5368
- Support SDK backward compatibility test by @zpoint in #5398
- Add helm support for RunPod credentials by @funkypenguin in #5214
- [aws] script to get default security group name for aws by @SeungjinYang in #5427
- Update pypi description by @Michaelvll in #5444
- avoid using removed LEGACY_SINGLETON_REGION constant by @cg505 in #5441
- [Docs] Fix DWS/Kueue title and URL by @Michaelvll in #5443
- release pipeline trigger filter based on name by @zpoint in #5367
- [Doc] Add runpod credentials setup for API server by @aylei in #5433
- Fix flaky of
test_multi_echo-- change sshd config to support large number of jobs by @zpoint in #5323 - Support launch controller and jobs on different cloud for smoke test by @zpoint in #5435
- (Helm chart) Add configurable ingress host by @turtlebasket in #5452
- [Example] AWS EFA Example by @KeplerC in #5318
- [runpod] preserve docker configured environment variables by @SeungjinYang in #5451
- [Docs] Clarify Nebius credential setup by @Michaelvll in #5298
- [k8s] gpu bin packing via affinity by @SeungjinYang in #5423
- [docs] leave in instructions to deal with omegaconf until next stable release by @SeungjinYang in #5460
- [k8s] CPU only jobs to prefer nodes without GPUs by @SeungjinYang in #5357
- Fix failure of
test_kubernetes_context_failoverby @zpoint in #5455 - Fix flaky of
test_cancel_launch_and_exec_asyncby @zpoint in #5456 - [GCP] Remap series-specific disk types by @JiangJiaWei1103 in #5457
- add task envs to event_callback by @ggilley in #5474
- [Nebius] Conditionally mount AWS credential files for Nebius profile by @SalikovAlex in #5464
- Remove upper limit on urllib3 version by @vnavkal in #5469
- fix controller cluster name breaking by @cg505 in #5482
- [UX][k8s] backwards compatibility for k8s show-gpus by @kyuds in #5488
- [UX] Make
sky checkparallel by @kyuds in #5483 - reload AWS_SESSION_TOKEN and KUBECONFIG on local API server by @cg505 in #5478
- [jobs/serve] validate controller name before updating value by @cg505 in #5486
- [Nebius] Add support config file and remove hardcode by @SalikovAlex in #5463
- [k8s] do not consider nodes with exact cpu/mem requirements by @SeungjinYang in #5481
- [docs] snippet on multi node jobs in k8s by @SeungjinYang in #5495
- [k8s] fix helm chart deployment of API server by @SeungjinYang in #5507
- Release pipeline refactor - automated release by @zpoint in #5470
- Use more specific header name by @colinjc in #5515
- remove API version bump, add bw compatibility code by @SeungjinYang in #5522
- chore: minor fix to api server documentation by @SeungjinYang in #5512
- Reload AWS default profile for local API server by @aylei in #5511
- [Examples] Llama 3.1 lora finetuning torch version pin by @romilbhardwaj in #5531
- [Docker] Add private docker registry by @Michaelvll in #5526
- Add optional version parameter to docker build pipeline to prevent version mismatch by @zpoint in #5525
- [GCP] Correctly delete cpu mig instance by @Michaelvll in #5524
- [Do...
SkyPilot v0.9.2
This patch release is a minor bump from v0.9.1 to resolve an issue that could affect users with old clusters from v0.7 and earlier (#5439).
See the full v0.9 release notes for everything new in SkyPilot v0.9!
SkyPilot v0.9.1
SkyPilot v0.9.1: API Server Architecture, Web Dashboard, Faster Storage, Improved Configuration and more!
We're excited to announce the release of SkyPilot v0.9.1! This update brings major improvements to SkyPilot, making it faster, more powerful and flexible for production-ready deployment.
Highlights
Client-Server Architecture
The new client-server model transforms SkyPilot from a single-user system into a scalable, multi-user platform, making it easier for individuals and teams to run and manage their workloads.
- Unified view and management: Get a single view of all running clusters and jobs across the organization and all infra you have.
- Integrate with workflow orchestrators: SkyPilot state is centralized on the API server, does not need to be maintained in orchestrators like Airflow.
- Multi-tenancy: Share clusters, jobs, and services securely among teammates.
Web dashboard
SkyPilot has a new dashboard! Easily view and manage your clusters, jobs and logs.
Access it with sky dashboard.
New configuration system
SkyPilot now supports specifying configuration at various levels: CLI, SkyPilot YAML, project-level config, client-level global config and server-side config.
You can now have a project configuration storing default values for all jobs in a project, a user configuration to apply globally to all projects and Task YAML overrides for specific jobs.
New mount_cached storage - 9.6x faster checkpointing
New storage mode mount_cached uses the local disk as a cache for cloud storage buckets. Boosts GPU utilizationby making cloud I/O asynchronous.
file_mounts:
/checkpoints:
source: gs://my-checkpoints-bucket
mode: MOUNT_CACHED # Will asynchronously upload all writes to the bucket
New cloud: Nebius
SkyPilot now supports Nebius cloud! Getting started is easy:
$ sky check nebius
$ sky launch --gpus H200:8 --cloud nebiusARM instance support - run SkyPilot on GH200s, GB200s, and more!
New native images for ARM instances allows you to run SkyPilot on your GH200s, GB200s on Lambda cloud, GCP or your own Kubernetes clusters! (#4835)
What's new
CLI & Core interfaces
skyCLI now returns non-zero exit code on launch/exec/logs/jobs launch/jobs logs failures (#4846)- This improves scriptability with
skyCLI in automated workflows.
- This improves scriptability with
sky checknow separately checks storage and compute capabilities (#4996, #4977)- New
--alloption forsky jobs queueto show all jobs (#4923) resources.gpuscan now be used to aliasresources.acceleratorsin the SkyPilot YAML (#5207)
Managed Jobs
- Multiple users can now share the same jobs controller (#4733)
- Autostop and autodown settings for the jobs controller can now be customized (#5182)
# ~/.sky/config.yaml jobs: controller: # autostop: false # to disable completely autostop: idle_minutes: 5 down: true - See other users's jobs with
sky jobs queue -uwhen using a shared controller (#4787) - Access to cloud object storage is no longer necessary for using file mounts or workdir in managed jobs. (#4708)
- Running managed jobs on Kubernetes no longer requires cloud access.
Storage
- New mode:
mount_cached(#4369)- This mode is optimized for checkpointing large models
- It asynchronously uploads the cached directory to the cloud storage bucket, increasing GPU utilization.
- Fix issue with openrsync on Mac OS 15 causing uploads failures (#5196)
.gitignorehandling is now more robust (#4988)- Fix exclusion for AWS bucket upload (#5128)
Kubernetes
- Revamped
/dev/fuseaccess mechanism on k8s (#5028)- We no longer need to request
smarter-devices-fuseresource, making SkyPilot fuse mounting compatible on autoscaling clusters.
- We no longer need to request
- B200 GPUs are now supported on GKE (#5102)
- Scale-to-zero autoscaling is now supported on GKE (#4935)
- SkyPilot can now inspect the node pools available on scale-to-zero clusters before provisioning.
- This allows SkyPilot to intelligently filter out clusters that cannot provision the requested GPU type.
sky checknow detects and hints for unlabeled GPU nodes on GKE (#5065)- GPU names are now case-insensitive; numbers-only name formats are now supported (#4756, #4925)
- Fixed fractional CPU support when using <1 CPU core (#4707)
- Fix node filtering when provisioning multiple GPUs (#4930)
initContainerscan now be overriden throughpod_config(#5247)- Instructions on mounting NFS volumes (#4951)
- GPU labelling script can now use custom context names (#5072)
- Fixed a bug where clusters from stale contexts could not be cleaned up (#4980)
Backend
- New Client-Server Architecture (#4660)
- This allows SkyPilot to be deployed as a remote service shared by multiple users.
- Fixed conda support when using python 3.12 (#4035)
sky execnow waits for the cluster to be started (#4867)sky local up --ipsnow supports specifying sudo password (#5030)- Clouds with expired credentials are now automatically excluded from failover (#5015)
SkyServe
- New Spot/On-demand Policy:
dynamic_fallback(#4628)- New
spot_placerfield can be set todynamic_fallbackto let SkyPilot automatically switch from spot to on-demand instances if spot instances are not available. - More details in paper
- New
- Fixed:
any_offield order issue causing version bump to not work (#4978) - Fixed: LiveError on controller (#4995)
Cloud Support
- New cloud: Nebius (#4573, #4838)
- GCP
- RunPod
- Custom docker images with non-root user are now supported (#4683)
- Lambda
- Fluidstack: NVLINK GPUs are now supported (#3954)
- IBM: new fetcher for IBM catalog (#5003)
- Cloudflare R2: fixed upload issues when using new awscli versions (#5282)
New Examples and Tutorials
- Large-Scale batch inference
- Vector DB ingest and querying
- RAG (Retrieval Augmented Generation)
- Deepseek r1-671B with SGLang
- Gemma3 Example
- Llama 4 example
- Hyperpod+EKS example
- High-Performance Model Checkpointing with
mount_cached
⚠️ Deprecations and removals
Removed
- Env vars starting with
SKY_are no longer supported. Use SKYPILOT_ env vars instead. - Old services from 0.7.0 (before #4439) may require to be stopped and restarted.
kubernetesis no longer a valid region name. use the k8s context name to specify a kubernetes cluster if required.
Deprecated
experimental.config_overrideshas been deprecated. Use theconfigfield instead.
Migration guide
SkyPilot 0.9.1 introduces the asynchronous execution model, which may cause compatibility issues with user programs using SkyPilot SDKs <=0.8.1.
Refer to the migration guide to upgrade your code.
TL;DR: Wrap all SkyPilot SDK function calls (except tail_logs) with sky.stream_and_get() to make your program behave mostly the same as before:
# <= 0.8.1
job_id, handle = sky.launch(task)
# 0.9.1
job_id, handle = sky.stream_and_get(sky.launch(task))
Thanks to all contributors!
New contributors: @kyuds, @BorenTsai, @funkypenguin, @JiangJiaWei1103, @SalikovAlex, @flaviomartins, @ajay, @bradhilton, @SeungjinYang, @eltociear, @vvidovic, @KennBro, @DanielZhangQD
Many thanks to all contributors who contributed to this release!
Contributors: @aylei, @zpoint, @SeungjinYang, @cg505, @michaelvl...
SkyPilot v0.8.1
This patch release is a minor bump over v0.8.0 to get you the latest fixes as soon as possible.
- Pin
wheel<0.46.0to mitigate build errors when launching clusters in environments withwheel>=0.46.0(#5153)
Stay tuned for a major upgrade coming up in v0.9.0!





