Skip to content

Conversation

@zpoint
Copy link
Collaborator

@zpoint zpoint commented Oct 30, 2025

We find some buildkite host OOM recently, could be our tests create high memory usage

Before this PR

We can see 16 workers [1 item] in the log

sky) zpoint:~/codes/skypilot (dev/zeping/disable_parallel_for_smoke_test)$ pytest tests/smoke_tests/test_examples.py::test_nemorl --aws -k accelerator0 --remote-server -v
D 10-30 15:32:59 skypilot_config.py:247] using default user config file: ~/.sky/config.yaml
D 10-30 15:32:59 skypilot_config.py:514] Config loaded from /home/zpoint/.sky/config.yaml:
D 10-30 15:32:59 skypilot_config.py:514] {}
D 10-30 15:32:59 skypilot_config.py:514] 
D 10-30 15:32:59 skypilot_config.py:521] Config syntax check passed for path: /home/zpoint/.sky/config.yaml
D 10-30 15:32:59 skypilot_config.py:274] using default project config file: .sky.yaml
D 10-30 15:32:59 skypilot_config.py:666] client config (before task and CLI overrides): 
D 10-30 15:32:59 skypilot_config.py:666] {}
D 10-30 15:32:59 skypilot_config.py:666] 
=============================================================================== test session starts ================================================================================
platform linux -- Python 3.10.18, pytest-8.4.1, pluggy-1.6.0
rootdir: /home/zpoint/codes/skypilot
configfile: pyproject.toml
plugins: env-1.1.5, anyio-4.9.0, asyncio-1.0.0, xdist-3.7.0
asyncio: mode=strict, asyncio_default_fixture_loop_scope=function, asyncio_default_test_loop_scope=function
16 workers [1 item]       
[nemorl] Test started. Log: less -r /tmp/nemorl-tmpc9lbq.log
[nemorl] Overriding API server endpoint: http://host.docker.internal:41540
[nemorl] Failed (returned 1).

After this PR

No more 16 workers [1 item] in the log

(sky) zpoint:~/codes/skypilot (dev/zeping/disable_parallel_for_smoke_test)$ pytest tests/smoke_tests/test_examples.py::test_nemorl --aws -k accelerator0 --remote-server -v
D 10-30 15:33:24 skypilot_config.py:247] using default user config file: ~/.sky/config.yaml
D 10-30 15:33:24 skypilot_config.py:514] Config loaded from /home/zpoint/.sky/config.yaml:
D 10-30 15:33:24 skypilot_config.py:514] {}
D 10-30 15:33:24 skypilot_config.py:514] 
D 10-30 15:33:24 skypilot_config.py:521] Config syntax check passed for path: /home/zpoint/.sky/config.yaml
D 10-30 15:33:24 skypilot_config.py:274] using default project config file: .sky.yaml
D 10-30 15:33:24 skypilot_config.py:666] client config (before task and CLI overrides): 
D 10-30 15:33:24 skypilot_config.py:666] {}
D 10-30 15:33:24 skypilot_config.py:666] 
['tests/smoke_tests/test_examples.py::test_nemorl']
=============================================================================== test session starts ================================================================================
platform linux -- Python 3.10.18, pytest-8.4.1, pluggy-1.6.0
rootdir: /home/zpoint/codes/skypilot
configfile: pyproject.toml
plugins: env-1.1.5, anyio-4.9.0, asyncio-1.0.0, xdist-3.7.0
asyncio: mode=strict, asyncio_default_fixture_loop_scope=function, asyncio_default_test_loop_scope=function
collected 1 item                                                                                                                                                                   

tests/smoke_tests/test_examples.py [nemorl] Test started. Log: less -r /tmp/nemorl-zd7dk38q.log

Tested (run the relevant ones):

  • Code formatting: install pre-commit (auto-check on commit) or bash format.sh
  • Any manual or new tests for this PR (please specify below)
  • All smoke tests: /smoke-test (CI) or pytest tests/test_smoke.py (local)
  • Relevant individual tests: /smoke-test -k test_name (CI) or pytest tests/test_smoke.py::test_name (local)
  • Backward compatibility: /quicktest-core (CI) or pytest tests/smoke_tests/test_backward_compat.py (local)

@zpoint zpoint requested review from Michaelvll, aylei and cg505 October 30, 2025 07:37
@zpoint
Copy link
Collaborator Author

zpoint commented Oct 30, 2025

/smoke-test --remote-server

Copy link
Collaborator

@aylei aylei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks @zpoint !

@zpoint
Copy link
Collaborator Author

zpoint commented Oct 30, 2025

/smoke-test --remote-server

@zpoint
Copy link
Collaborator Author

zpoint commented Oct 30, 2025

/smoke-test --remote-server

@aylei
Copy link
Collaborator

aylei commented Oct 30, 2025

Looks like the percentage > 20% now after we limit the cpu cores of remote server to 4 and decrease the base usage:


KEY                BASELINE     ACTUAL   INCREASE       %
--
  | 721:worker:short   161.8 MB   161.9 MB    +0.1 MB   +0.1%
  | 833:server         165.8 MB   187.3 MB   +21.5 MB  +13.0%
  | 746:server         165.6 MB   171.1 MB    +5.5 MB   +3.3%
  | 745:server         165.5 MB   171.2 MB    +5.6 MB   +3.4%
  | 683:server         166.9 MB   171.5 MB    +4.5 MB   +2.7%
  | 727:worker:short   162.2 MB   162.2 MB    +0.0 MB   +0.0%
  | 725:worker:short   222.6 MB   223.0 MB    +0.4 MB   +0.2%
  | 1056:worker:long     0.0 MB   244.7 MB  +244.7 MB   +inf%
  | TOTAL             1210.5 MB  1492.9 MB  +282.4 MB  +23.3%
  | FINFO:conftest:Last worker finished, cleaning up container...
  | sky-remote-test-buildkite-generic-remote-server3
  | sky-remote-test-buildkite-generic-remote-server3

Should be fine to tune the percentage threshold cc @kevinmingtarja

@zpoint
Copy link
Collaborator Author

zpoint commented Oct 30, 2025

/smoke-test --remote-server -k test_big_file_upload_memory_usage

@zpoint zpoint merged commit d365baf into skypilot-org:master Oct 30, 2025
21 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants