-
Notifications
You must be signed in to change notification settings - Fork 182
add calc_server #66
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
add calc_server #66
Conversation
|
@YYYYimo please read the following Contributor License Agreement(CLA). If you agree with the CLA, please reply with the following information.
Contributor License AgreementContribution License AgreementThis Contribution License Agreement (“Agreement”) is agreed to by the party signing below (“You”),
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR adds a calc_server.py module that orchestrates end-to-end training runs for the Calc-X example. The server manages the complete workflow from data loading through result collection.
Key changes:
- Implements a comprehensive orchestration server for the Calc-X example
- Adds flexible dataset loading with multiple fallback options (parquet → jsonl → demo data)
- Provides task queuing, worker process management, and rollout collection functionality
Comments suppressed due to low confidence (1)
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
examples/calc_x/calc_server.py
Outdated
| print(f"[server] loaded {len(samples)} samples") | ||
|
|
||
| # start agent workers in separate process | ||
| proc = mp.Process(target=trainer_process_entry, args=(SERVER_URL, n_workers, None), daemon=False) |
Copilot
AI
Aug 20, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The trainer process is not properly cleaned up if the main process exits unexpectedly. Consider using a context manager or try/finally block to ensure process cleanup.
examples/calc_x/calc_server.py
Outdated
| rdict = rollout.model_dump() | ||
| except Exception: | ||
| # fallback: attempt raw attributes | ||
| rdict = rollout.__dict__ |
Copilot
AI
Aug 20, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The fallback to rollout.__dict__ on line 116 may not work correctly if rollout is a Pydantic model, as Pydantic models store data differently. Consider using rollout.dict() as an alternative fallback for older Pydantic versions.
| rdict = rollout.__dict__ | |
| # fallback: use dict() for Pydantic v1 | |
| rdict = rollout.dict() |
| print("[server] terminating trainer process") | ||
| proc.terminate() | ||
| proc.join(timeout=5) | ||
|
|
Copilot
AI
Aug 20, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After proc.join(timeout=5), there's no check if the process actually terminated. If the timeout expires, the process may still be running. Consider adding a force kill if the process doesn't terminate gracefully.
| if proc.is_alive(): | |
| print("[server] WARNING: trainer process did not terminate after join; force killing") | |
| proc.terminate() | |
| proc.join(timeout=1) |
examples/calc_x/calc_server.py
Outdated
| OUTPUT_PATH = Path(__file__).parent / "rollouts.jsonl" | ||
|
|
||
|
|
||
| def trainer_process_entry(server_url: str, n_workers: int = 1, max_tasks: int | None = None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You don't need to put server and client in the same script.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, I’ll update it.
ProblemAfter enabling Key Error Messages:
Root CauseThe core flaw in the
|
|
Run vLLM with: |
|
You can ignore the failure when vLLM stream=True. Addressing that is a separate task that is being processed in parallel. Could you explain how you were going to run the example and what's expected output of this example? |
Yes, to run the
Expected Output: |
Calc server that orchestrates end-to-end runs for the Calc‑X example.
It starts an AgentLightning server, loads the dataset (prefers examples/calc_x/data/*.parquet, falls back to data.jsonl or a small demo), spawns a Trainer worker process running CalcAgent, queues all tasks, polls for completed rollouts, writes rollouts to rollouts.jsonl, and performs cleanup.