Skip to content

Conversation

@ncrispino
Copy link
Collaborator

PR Title Format

feat: implement inject-and-continue for agent coordination (preempt-not-restart)

Description

This PR implements the "inject-and-continue" approach for agent coordination, replacing the previous "restart from scratch" behavior when new answers arrive. When an agent provides a new_answer while other agents are working, those agents now receive an update injection and continue their work with preserved context, rather than being killed and restarted from zero.

Key improvements:

  • Agents preserve their full thinking history when receiving updates
  • No wasted computation regenerating ideas from scratch
  • Enables true collaborative building where agents can synthesize and improve each other's work
  • More specific workspace update messages showing exactly which workspaces were affected

Type of change

  • New feature (feat:) - Non-breaking change which adds functionality
  • Documentation (docs:) - Documentation updates
  • Bug fix (fix:) - Non-breaking change which fixes an issue
  • Breaking change (breaking:) - Fix or feature that would cause existing functionality to not work as expected
  • Code refactoring (refactor:) - Code changes that neither fix a bug nor add a feature
  • Tests (test:) - Adding missing tests or correcting existing tests
  • Chore (chore:) - Maintenance tasks, dependency updates, etc.
  • Performance improvement (perf:) - Code changes that improve performance
  • Code style (style:) - Changes that do not affect the meaning of the code
  • CI/CD (ci:) - Changes to CI/CD configuration files and scripts

Changes Made

Core Implementation (massgen/orchestrator.py)

  1. Removed premature restart check (lines 1797-1802):

    • Previously: Agents were restarted from scratch at the start of execution if restart_pending=True
    • Now: All agents go through injection logic in the iteration loop for consistent behavior
  2. Enhanced _inject_update_and_continue() method:

    • Properly tracks coordination state in both orchestrator and coordination tracker
    • Clears pending_agent_restarts flag after successful injection
    • Handles edge case where agent already has full context (no new answers to inject)
  3. Improved _build_update_message() method:

    • Now shows specific workspace paths affected by the update
    • Example: "agent1's work: /path/to/temp_workspaces/gemini_agent/"
    • Helps agents know exactly where to find new files

Coordination Tracking (massgen/coordination_tracker.py, massgen/utils.py)

  • Added UPDATE_INJECTED event type to track when agents receive mid-work updates
  • Properly integrated with existing restart tracking mechanisms

Documentation Updates

Design Documentation (docs/dev_notes/preempt_not_restart_design.md)

  • Added implementation status section with completion date
  • Documented race condition limitation (acceptable by design)
  • Explained safe-point injection and why agents won't be interrupted mid-stream
  • Included real example from test logs

Architecture Documentation (docs/source/development/architecture.rst)

  • Added comprehensive "Inject-and-Continue (Preempt-Not-Restart)" section
  • Visual comparison of traditional vs MassGen approach
  • Explained benefits: context preservation, efficiency, better collaboration
  • Documented safe-point injection mechanism and race condition

Core Concepts Documentation (docs/source/user_guide/concepts.rst)

  • Updated coordination flow diagram from "RESTART coordination" to "INJECT update to others"
  • Changed key insight from "Restart on new_answer" to "Inject-and-continue"
  • Clarified that agents receive updates mid-work and continue with preserved thinking

Technical Details

How It Works

Before (Restart Approach):

Agent A: Working on solution... [thinking deeply about approach X]
Agent B: ✅ Provides new answer
         ↓
Agent A: 🔁 RESTART - Kill stream, clear context, start fresh
         ❌ Lost all thinking about approach X

After (Inject-and-Continue):

Agent A: Working on solution... [thinking deeply about approach X]
Agent B: ✅ Provides new answer
         ↓
Agent A: 📨 UPDATE RECEIVED - Inject new context and continue
         ✅ Keeps all thinking about approach X
         ✅ Can now build on Agent B's answer

Thus the agent will never throw away its context; instead of being forced to restart, the agent will continue its stream with the new answer from the other agent in its context. Without this, we are wasting runs and potentially eliminating diversity as agents will want to converge earlier.

Safe-Point Injection

Updates are injected at safe points:

  • ✅ Between iteration loops (after completing a response)
  • ✅ When agent checks for new context
  • ❌ NOT mid-stream (would break agent reasoning)

Race Condition (Acceptable)

If an agent is deep in its first response when a new answer arrives:

  • Won't see injection until completing that response
  • By then, may already have full context from orchestrator
  • Agent still gets all answers, just via different mechanism
  • This is acceptable - same final outcome

Checklist

  • I have run pre-commit on my changed files and all checks pass
  • My code follows the style guidelines of this project
  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes

Pre-commit status

# Pre-commit checks to be run before merging
uv run pre-commit run --files massgen/orchestrator.py massgen/coordination_tracker.py massgen/utils.py
uv run pre-commit run --files docs/dev_notes/preempt_not_restart_design.md
uv run pre-commit run --files docs/source/development/architecture.rst docs/source/user_guide/concepts.rst

How to Test

Test CLI Command

Prerequisites:

  • Configure test file with 3 agents (fast models recommended for quick testing)
  • Use test_preempt_not_restart.yaml as reference

Test command:

# Run with test config
uv run massgen --config test_preempt_not_restart.yaml "Create a simple website about dogs"

# Monitor logs for injection events
tail -f .massgen/massgen_logs/log_*/massgen.log | grep -i "inject"

Alternative test (watch coordination table in real-time):

# Start MassGen in one terminal
uv run massgen --config test_preempt_not_restart.yaml "Create a website about Bob Dylan"

# In another terminal, watch coordination events
watch -n 2 'find .massgen/massgen_logs -name "coordination_events.json" -exec tail -20 {} \;'

Expected Results

Console Output:

  • Agents show 📨 [agent_id] receiving update with new answers when injection happens
  • Coordination table logs should show UPDATE_INJECTED events
  • Agents should continue their reasoning without full restart

Log Verification:

# Check for successful injections
grep "Injecting update for" .massgen/massgen_logs/log_*/massgen.log

# Verify agents received NEW answers (not empty)
grep "NEW answers since agent started" .massgen/massgen_logs/log_*/massgen.log

# Count injection events
grep -c "UPDATE_INJECTED" .massgen/massgen_logs/log_*/attempt_1/coordination_events.json

Example successful injection log:

17:03:23 | INFO | [Orchestrator] Agent grok_agent started with 0 answer(s), now has 1 answer(s)
17:03:23 | INFO | [Orchestrator] NEW answers since agent started: ['gemini_agent']
17:03:23 | INFO | [Orchestrator] Injecting update for grok_agent

Workspace Update Message (agents with filesystem):

WORKSPACE UPDATE:
- Your workspace files are preserved
- New workspace snapshots available from 2 agent(s):
  - agent1's work: /path/to/temp_workspaces/gemini_agent/
  - agent3's work: /path/to/temp_workspaces/grok_agent/

What Success Looks Like

  1. ✅ Agents receive update injections mid-work (not always at start)
  2. ✅ Agents continue with preserved context
  3. ✅ Agents can reference new answers in their reasoning
  4. ✅ Coordination events show UPDATE_INJECTED type
  5. ✅ Workspace paths are specific to affected agents

Edge Cases Tested

  • Fast agent provides answer early: Other agents should get injection
  • Slow agent deep in first response: May get full context on restart (acceptable)
  • No new answers to inject: Flag cleared, agent proceeds normally
  • Multiple rapid updates: Each handled sequentially at safe points

Additional Context

Related Issues

Design Decisions

Why not interrupt mid-stream?

  • Would break agent reasoning and potentially corrupt responses
  • Safe-point injection ensures agent completes current thought
  • Race condition is acceptable - agent still gets all context

Why specific workspace paths?

  • Helps agents know exactly where to find new files
  • Reduces confusion about which workspaces have updates
  • Makes collaboration more explicit and debuggable

Future Improvements

  • Add metrics for injection effectiveness
  • Consider batching multiple rapid updates into single injection
  • Explore pre-emptive hints for agents about incoming updates

Breaking Changes

None. This is a pure enhancement - existing behavior remains for agents without active streams.

Migration Notes

No migration needed. Feature works automatically with existing configs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[FEATURE] Capture agent in-progress summaries/memories during restart_pending to preserve diverse perspectives

2 participants