Skip to content

Conversation

@hhoikoo
Copy link
Member

@hhoikoo hhoikoo commented Oct 31, 2025

resolves #6432 (BA-2851)

This change adds configuration for partitioning resources rather than every agent always seeing the full resource pool. This prevents unintended over-allocation that could crash kernels.

  • SHARED: allows all agents to see full resources (useful for stress testing). This is the same behavior as before.
  • AUTO_SPLIT: automatically divides resources equally among agents.
  • MANUAL: lets users specify exact per-agent allocations for all resources.

Single-agent deployments remain unaffected and retain access to all available hardware resources.

Checklist: (if applicable)

  • Milestone metadata specifying the target backport version
  • Mention to the original issue
  • Installer updates including:
    • Fixtures for db schema changes
    • New mandatory config options
  • Update of end-to-end CLI integration tests in ai.backend.test
  • API server-client counterparts (e.g., manager API -> client SDK)
  • Test case(s) to:
    • Demonstrate the difference of before/after
    • Demonstrate the flow of abstract/conceptual models with a concrete implementation
  • Documentation
    • Contents in the docs directory
    • docstrings in public interfaces and type annotations

@github-actions github-actions bot added size:XL 500~ LoC comp:agent Related to Agent component labels Oct 31, 2025
@hhoikoo hhoikoo force-pushed the feat/BA-2851/multi-agent-resources branch from e6c1f4b to d84258e Compare October 31, 2025 01:30
@hhoikoo hhoikoo requested a review from Copilot October 31, 2025 01:30
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR introduces resource isolation options for multi-agent setups, enabling multiple agents to run on the same physical host with controlled resource allocation. The implementation adds three allocation modes: SHARED (default, backward compatible), AUTO_SPLIT (automatic equal division), and MANUAL (explicit per-agent configuration).

Key changes:

  • Introduces ResourcePartitioner class to manage resource allocation across agents
  • Adds ResourceAllocationMode enum with SHARED, AUTO_SPLIT, and MANUAL modes
  • Implements validation logic to ensure consistent manual allocations across agents
  • Updates agent initialization to use resource partitioning

Reviewed Changes

Copilot reviewed 10 out of 10 changed files in this pull request and generated 23 comments.

Show a summary per file
File Description
src/ai/backend/agent/resources.py Adds ResourcePartitioner class and changes abstract methods to raise NotImplementedError
src/ai/backend/agent/config/unified.py Defines allocation modes, new config fields (allocated_cpu/mem/disk/devices), and validation logic
src/ai/backend/agent/agent.py Integrates ResourcePartitioner into agent initialization and updates slot calculations
src/ai/backend/agent/server.py Creates ResourcePartitioner instances per agent and adds resource reconciliation
src/ai/backend/agent/docker/agent.py Adds resource_partitioner parameter to constructor
src/ai/backend/agent/kubernetes/agent.py Adds resource_partitioner parameter to constructor
tests/agent/test_resource_allocation.py Comprehensive unit tests for all three allocation modes
tests/agent/test_config_validation.py Tests for config validation of allocation modes and device consistency
tests/agent/docker/test_agent.py Updates test to pass ResourcePartitioner to agent
changes/6498.feature.md Changelog entry

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@hhoikoo hhoikoo force-pushed the feat/BA-2851/multi-agent-resources branch from d84258e to c5114a9 Compare October 31, 2025 03:56
@hhoikoo hhoikoo force-pushed the feat/BA-2753/multiple-agents branch from 9f12687 to fdee4b0 Compare November 3, 2025 01:05
@hhoikoo hhoikoo force-pushed the feat/BA-2851/multi-agent-resources branch 3 times, most recently from 310d847 to 3faac0f Compare November 4, 2025 06:02
@hhoikoo hhoikoo force-pushed the feat/BA-2753/multiple-agents branch from fdee4b0 to 90f0702 Compare November 4, 2025 06:10
@hhoikoo hhoikoo force-pushed the feat/BA-2851/multi-agent-resources branch 2 times, most recently from 36824ac to 279e71b Compare November 4, 2025 06:30
@hhoikoo hhoikoo force-pushed the feat/BA-2753/multiple-agents branch from 90f0702 to e2b1902 Compare November 4, 2025 06:35
@hhoikoo hhoikoo force-pushed the feat/BA-2851/multi-agent-resources branch 3 times, most recently from 280831f to db07080 Compare November 4, 2025 10:32
@github-actions github-actions bot added the comp:manager Related to Manager component label Nov 4, 2025
This change adds configuration for partitioning resources rather than
every agent always seeing the full resource pool. This
prevents unintended over-allocation that could crash kernels.

SHARED mode allows all agents to see full resources (useful for
stress testing). This is the same behavior as before.
AUTO_SPLIT automatically divides resources equally among agents.
MANUAL mode lets users specify exact per-agent allocations for all
resources.

Single-agent deployments remain unaffected and retain access to all
available hardware resources.
@hhoikoo hhoikoo force-pushed the feat/BA-2851/multi-agent-resources branch from db07080 to 3aac7df Compare November 5, 2025 02:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comp:agent Related to Agent component comp:manager Related to Manager component size:XL 500~ LoC

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants