Skip to content

Conversation

@acured
Copy link
Collaborator

@acured acured commented Oct 23, 2025

Issue:
agentops mock service and OTL batchprocessor shutdown are not synchronized:
🖇 AgentOps: Network error during span export: HTTPConnectionPool(host='localhost', port=41733): Max retries exceeded with url: /traces (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7e0a521ffd10>: Failed to establish a new connection: [Errno 111] Connection refused'))

Bug: https://github.com/microsoft/agent-lightning/issues/51

Fixed:

  1. override batchprocessor exporter.
  2. set auto start session to manual start.

@acured acured force-pushed the hao/add_agentops_explorter_patch branch from 135849d to 8f52ca1 Compare October 24, 2025 01:30
@acured acured marked this pull request as ready for review October 24, 2025 03:02
return self.server_port


class SwitchableAuthenticatedOTLPExporter(AuthenticatedOTLPExporter):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We might need to add tests for the new exporters, to make sure they do not crash or raise more warnings.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should be something like Bypassable...Exporter (switchable is kind of confusing and chinglish...

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

ctx = self._lightning_span_processor.with_context(store=store, rollout_id=rollout_id, attempt_id=attempt_id)
with ctx as processor:
kwargs: dict[str, Any] = {}
if name is not None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggest wrap "end_trace" in finally.

Also I remember end_trace can report a job status. Is that supported now?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.


if not agentops.get_client().initialized:
agentops.init() # type: ignore
agentops.init(auto_start_session=False) # type: ignore
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

now the agnetops mock server may not be needed any more. Suggest removing them in this PR.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed.

yield processor
except:
# Need logging or raise here?
status = StatusCode.ERROR # type: ignore
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need logging or raise here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure tracer trace_context can catch errors from runner. Need to test it. Maybe when this except catches something, it's a critical error related to tracer.

Anyway logging doesn't harm. especially logger.debug

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK.

@acured acured closed this Oct 24, 2025
@acured acured reopened this Oct 24, 2025
_switch = False


def set_switch(value: bool):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rename it as something like bypass_agentops_service(enabled: bool)

Now I look at it, switch is not a good name. Very confusing.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

app = agentops_local_server()
app.run(**kwargs)

class SwitchableOTLPMetricExporter(OTLPMetricExporter):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add docstrings for new classes

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@acured acured closed this Oct 24, 2025
@acured acured reopened this Oct 24, 2025
return self.server_port


class SwitchableAuthenticatedOTLPExporter(AuthenticatedOTLPExporter):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should be something like Bypassable...Exporter (switchable is kind of confusing and chinglish...

finally:
agentops.end_trace(trace, end_state=status) # type: ignore
elif store is None and rollout_id is None and attempt_id is None:
with self._lightning_span_processor:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so you don't care about this path?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what do you think?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing, fixed.

base_url = f"http://dummy"
env_vars_to_set = {
"AGENTOPS_API_KEY": "dummy",
"AGENTOPS_API_ENDPOINT": base_url,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think api endpoint and app url and exporter endpoint no longer need to be overwritten.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think?

Copy link
Collaborator Author

@acured acured Oct 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed, but we have to keep "AGENTOPS_API_KEY" avoid initialization failed in agentops sdk.

@ultmaster ultmaster marked this pull request as draft October 24, 2025 17:22
@ultmaster ultmaster marked this pull request as ready for review October 24, 2025 17:22
Copilot AI review requested due to automatic review settings October 24, 2025 17:22
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR addresses synchronization issues between the AgentOps mock service and OTL batch processor shutdown by replacing the local mock server approach with a switchable exporter pattern that can toggle communication with the AgentOps service.

Key Changes:

  • Replaced the AgentOps local mock server implementation with switchable exporters/clients that can operate in local mode
  • Modified AgentOps initialization to use manual session start instead of auto-start
  • Added trace context management with proper status tracking and error handling

Reviewed Changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 5 comments.

File Description
tests/instrumentation/test_agentops.py New test file validating switchable exporter behavior when toggling service communication on/off
agentlightning/tracer/agentops.py Removed server manager, simplified worker initialization to use dummy endpoint, added trace lifecycle management with status tracking
agentlightning/instrumentation/agentops.py Replaced mock server implementation with switchable exporters/clients that inherit from AgentOps classes and conditionally execute based on service enabled flag

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

"""
global _agentops_service_enabled
_agentops_service_enabled = enabled
logger.info(f"Switch set to {value} for exporters and clients.")
Copy link

Copilot AI Oct 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The log message references an undefined variable value. It should reference enabled instead.

Suggested change
logger.info(f"Switch set to {value} for exporters and clients.")
logger.info(f"Switch set to {enabled} for exporters and clients.")

Copilot uses AI. Check for mistakes.
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

logger.warning(f"AgentOps still not ready after {attempt} attempts. Retrying...")
def fetch_auth_token(self, *args: Any, **kwargs: Any) -> AuthTokenResponse:
if _agentops_service_enabled:
resp = super().post(*args, **kwargs)
Copy link

Copilot AI Oct 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Method calls super().post() but the parent class V3Client has a fetch_auth_token method, not a generic post method. This should call super().fetch_auth_token(*args, **kwargs) to properly delegate to the parent implementation.

Suggested change
resp = super().post(*args, **kwargs)
resp = super().fetch_auth_token(*args, **kwargs)

Copilot uses AI. Check for mistakes.
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

f"[Worker {worker_id}] AgentOps managed, but local server port is not available. Client may not connect as expected."
)

base_url = f"http://dummy"
Copy link

Copilot AI Oct 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] Hardcoded dummy URL should be defined as a module-level constant (e.g., DUMMY_AGENTOPS_ENDPOINT = \"http://dummy\") to improve maintainability and make it easier to locate if it needs to be changed.

Copilot uses AI. Check for mistakes.
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Skip this improvement.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it actually makes sense if we don't need to set API_ENDPOINT any more.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right

SwitchableAuthenticatedOTLPExporter,
SwitchableOTLPMetricExporter,
SwitchableOTLPSpanExporter,
set_switch,
Copy link

Copilot AI Oct 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Import references set_switch function which does not exist in the module. Based on the implementation in agentops.py, this should be enable_agentops_service instead.

Copilot uses AI. Check for mistakes.
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

with patch.object(
switchable_authenticated_exporter.__class__.__bases__[0], "export", return_value=SpanExportResult.SUCCESS
) as mock_export:
set_switch(True)
Copy link

Copilot AI Oct 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Function set_switch is called but does not exist in the imported module. This should call enable_agentops_service(True) based on the actual implementation.

Copilot uses AI. Check for mistakes.
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

@acured acured force-pushed the hao/add_agentops_explorter_patch branch from 085c604 to 1543515 Compare October 27, 2025 01:27
@acured acured closed this Oct 27, 2025
@acured acured reopened this Oct 27, 2025
@acured acured closed this Oct 27, 2025
@acured acured reopened this Oct 27, 2025
@acured acured force-pushed the hao/add_agentops_explorter_patch branch from e7c6eba to 4a05d89 Compare October 28, 2025 01:51
@acured acured closed this Oct 28, 2025
@acured acured reopened this Oct 28, 2025
@ultmaster
Copy link
Contributor

Please resolve the conflicts

@acured acured force-pushed the hao/add_agentops_explorter_patch branch from 4a05d89 to 03212cf Compare October 28, 2025 05:53
logger.warning("instrument_managed=False. You are responsible for all instrumentation.")

def __getstate__(self):
state = self.__dict__.copy()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if that's removed, we don't need getstate and setstate any more.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed.

base_url = f"http://dummy"
env_vars_to_set = {
"AGENTOPS_API_KEY": "dummy",
"AGENTOPS_API_ENDPOINT": base_url,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think?

f"[Worker {worker_id}] AgentOps managed, but local server port is not available. Client may not connect as expected."
)

base_url = f"http://dummy"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it actually makes sense if we don't need to set API_ENDPOINT any more.

yield processor
except Exception as e:
status = StatusCode.ERROR # type: ignore
logger.debug(f"Trace failed for rollout_id={rollout_id}, attempt_id={attempt_id}, error={e}")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that should be logger.error at least

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes changed.

finally:
agentops.end_trace(trace, end_state=status) # type: ignore
elif store is None and rollout_id is None and attempt_id is None:
with self._lightning_span_processor:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what do you think?

@ultmaster
Copy link
Contributor

/ci

@github-actions
Copy link

github-actions bot commented Oct 28, 2025

🚀 CI Watcher for correlation id-3454803530-mha6v8ny triggered by comment 3454803530
🏃‍♀️ Tracking 6 workflow run(s):

✅ All runs completed.

@acured
Copy link
Collaborator Author

acured commented Oct 28, 2025

Please resolve the conflicts

Fixed.

else:
raise ValueError("store, rollout_id, and attempt_id must be either all provided or all None")
except Exception as e:
status = StatusCode.ERROR # type: ignore
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you tested whether the exception inside the trace_context will be caught here?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, unless an exception is handled in def _exit (_, exce_type, exce, tb), it can be caught externally.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please add a test case.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added

kwargs: dict[str, Any] = {}
if name is not None:
kwargs["trace_name"] = str(name)
trace = agentops.start_trace(**kwargs)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does start_trace have other kwargs?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Args:
    trace_name: Name for the trace (e.g., "session", "my_custom_task").
    tags: Optional tags to attach to the trace span (list of strings or dict).

maybe we can use: agentops.start_trace(trace_name=str(name))

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if name is None, we can use rollout_id?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added

@ultmaster
Copy link
Contributor

/ci

@github-actions
Copy link

github-actions bot commented Oct 28, 2025

🚀 CI Watcher for correlation id-3456516543-mham3qlk triggered by comment 3456516543
🏃‍♀️ Tracking 6 workflow run(s):

✅ All runs completed.

@ultmaster
Copy link
Contributor

@acured Please avoid force push. This will toss away the diff and make review difficult. When merging, we will squash. So the chaotic commit history on the feature branch does not matter anyway.

@acured
Copy link
Collaborator Author

acured commented Oct 29, 2025

@acured Please avoid force push. This will toss away the diff and make review difficult. When merging, we will squash. So the chaotic commit history on the feature branch does not matter anyway.

OK

@acured acured closed this Oct 31, 2025
@acured acured reopened this Oct 31, 2025
@acured acured closed this Oct 31, 2025
@acured acured reopened this Oct 31, 2025
@acured
Copy link
Collaborator Author

acured commented Oct 31, 2025

/ci

@github-actions
Copy link

github-actions bot commented Oct 31, 2025

🚀 CI Watcher for correlation id-3472335385-mhepb2dm triggered by comment 3472335385
🏃‍♀️ Tracking 6 workflow run(s):

✅ All runs completed.

@github-actions
Copy link

github-actions bot commented Oct 31, 2025

🚀 CI Watcher for correlation id-3472679590-mhes6f6j triggered by comment 3472679590
🏃‍♀️ Tracking 6 workflow run(s):

✅ All runs completed.

@ultmaster ultmaster changed the title Add agentops client(explorter) patch Replace AgentOps mock server with bypassable client Oct 31, 2025
@ultmaster ultmaster merged commit 3f372ff into microsoft:main Oct 31, 2025
8 checks passed
totoluo pushed a commit to totoluo/agent-lightning that referenced this pull request Nov 14, 2025
---------

Co-authored-by: Hao Ni (CSI Interfusion Co Ltd) <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants