-
Notifications
You must be signed in to change notification settings - Fork 687
Replace AgentOps mock server with bypassable client #202
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Replace AgentOps mock server with bypassable client #202
Conversation
135849d to
8f52ca1
Compare
| return self.server_port | ||
|
|
||
|
|
||
| class SwitchableAuthenticatedOTLPExporter(AuthenticatedOTLPExporter): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We might need to add tests for the new exporters, to make sure they do not crash or raise more warnings.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should be something like Bypassable...Exporter (switchable is kind of confusing and chinglish...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
agentlightning/tracer/agentops.py
Outdated
| ctx = self._lightning_span_processor.with_context(store=store, rollout_id=rollout_id, attempt_id=attempt_id) | ||
| with ctx as processor: | ||
| kwargs: dict[str, Any] = {} | ||
| if name is not None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggest wrap "end_trace" in finally.
Also I remember end_trace can report a job status. Is that supported now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
|
|
||
| if not agentops.get_client().initialized: | ||
| agentops.init() # type: ignore | ||
| agentops.init(auto_start_session=False) # type: ignore |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
now the agnetops mock server may not be needed any more. Suggest removing them in this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed.
agentlightning/tracer/agentops.py
Outdated
| yield processor | ||
| except: | ||
| # Need logging or raise here? | ||
| status = StatusCode.ERROR # type: ignore |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need logging or raise here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure tracer trace_context can catch errors from runner. Need to test it. Maybe when this except catches something, it's a critical error related to tracer.
Anyway logging doesn't harm. especially logger.debug
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK.
| _switch = False | ||
|
|
||
|
|
||
| def set_switch(value: bool): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rename it as something like bypass_agentops_service(enabled: bool)
Now I look at it, switch is not a good name. Very confusing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
| app = agentops_local_server() | ||
| app.run(**kwargs) | ||
|
|
||
| class SwitchableOTLPMetricExporter(OTLPMetricExporter): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add docstrings for new classes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
| return self.server_port | ||
|
|
||
|
|
||
| class SwitchableAuthenticatedOTLPExporter(AuthenticatedOTLPExporter): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should be something like Bypassable...Exporter (switchable is kind of confusing and chinglish...
agentlightning/tracer/agentops.py
Outdated
| finally: | ||
| agentops.end_trace(trace, end_state=status) # type: ignore | ||
| elif store is None and rollout_id is None and attempt_id is None: | ||
| with self._lightning_span_processor: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so you don't care about this path?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing, fixed.
agentlightning/tracer/agentops.py
Outdated
| base_url = f"http://dummy" | ||
| env_vars_to_set = { | ||
| "AGENTOPS_API_KEY": "dummy", | ||
| "AGENTOPS_API_ENDPOINT": base_url, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think api endpoint and app url and exporter endpoint no longer need to be overwritten.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed, but we have to keep "AGENTOPS_API_KEY" avoid initialization failed in agentops sdk.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR addresses synchronization issues between the AgentOps mock service and OTL batch processor shutdown by replacing the local mock server approach with a switchable exporter pattern that can toggle communication with the AgentOps service.
Key Changes:
- Replaced the AgentOps local mock server implementation with switchable exporters/clients that can operate in local mode
- Modified AgentOps initialization to use manual session start instead of auto-start
- Added trace context management with proper status tracking and error handling
Reviewed Changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 5 comments.
| File | Description |
|---|---|
| tests/instrumentation/test_agentops.py | New test file validating switchable exporter behavior when toggling service communication on/off |
| agentlightning/tracer/agentops.py | Removed server manager, simplified worker initialization to use dummy endpoint, added trace lifecycle management with status tracking |
| agentlightning/instrumentation/agentops.py | Replaced mock server implementation with switchable exporters/clients that inherit from AgentOps classes and conditionally execute based on service enabled flag |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| """ | ||
| global _agentops_service_enabled | ||
| _agentops_service_enabled = enabled | ||
| logger.info(f"Switch set to {value} for exporters and clients.") |
Copilot
AI
Oct 24, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The log message references an undefined variable value. It should reference enabled instead.
| logger.info(f"Switch set to {value} for exporters and clients.") | |
| logger.info(f"Switch set to {enabled} for exporters and clients.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
| logger.warning(f"AgentOps still not ready after {attempt} attempts. Retrying...") | ||
| def fetch_auth_token(self, *args: Any, **kwargs: Any) -> AuthTokenResponse: | ||
| if _agentops_service_enabled: | ||
| resp = super().post(*args, **kwargs) |
Copilot
AI
Oct 24, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Method calls super().post() but the parent class V3Client has a fetch_auth_token method, not a generic post method. This should call super().fetch_auth_token(*args, **kwargs) to properly delegate to the parent implementation.
| resp = super().post(*args, **kwargs) | |
| resp = super().fetch_auth_token(*args, **kwargs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
agentlightning/tracer/agentops.py
Outdated
| f"[Worker {worker_id}] AgentOps managed, but local server port is not available. Client may not connect as expected." | ||
| ) | ||
|
|
||
| base_url = f"http://dummy" |
Copilot
AI
Oct 24, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nitpick] Hardcoded dummy URL should be defined as a module-level constant (e.g., DUMMY_AGENTOPS_ENDPOINT = \"http://dummy\") to improve maintainability and make it easier to locate if it needs to be changed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Skip this improvement.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it actually makes sense if we don't need to set API_ENDPOINT any more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right
| SwitchableAuthenticatedOTLPExporter, | ||
| SwitchableOTLPMetricExporter, | ||
| SwitchableOTLPSpanExporter, | ||
| set_switch, |
Copilot
AI
Oct 24, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Import references set_switch function which does not exist in the module. Based on the implementation in agentops.py, this should be enable_agentops_service instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
| with patch.object( | ||
| switchable_authenticated_exporter.__class__.__bases__[0], "export", return_value=SpanExportResult.SUCCESS | ||
| ) as mock_export: | ||
| set_switch(True) |
Copilot
AI
Oct 24, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Function set_switch is called but does not exist in the imported module. This should call enable_agentops_service(True) based on the actual implementation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
085c604 to
1543515
Compare
e7c6eba to
4a05d89
Compare
|
Please resolve the conflicts |
4a05d89 to
03212cf
Compare
agentlightning/tracer/agentops.py
Outdated
| logger.warning("instrument_managed=False. You are responsible for all instrumentation.") | ||
|
|
||
| def __getstate__(self): | ||
| state = self.__dict__.copy() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if that's removed, we don't need getstate and setstate any more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed.
agentlightning/tracer/agentops.py
Outdated
| base_url = f"http://dummy" | ||
| env_vars_to_set = { | ||
| "AGENTOPS_API_KEY": "dummy", | ||
| "AGENTOPS_API_ENDPOINT": base_url, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you think?
agentlightning/tracer/agentops.py
Outdated
| f"[Worker {worker_id}] AgentOps managed, but local server port is not available. Client may not connect as expected." | ||
| ) | ||
|
|
||
| base_url = f"http://dummy" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it actually makes sense if we don't need to set API_ENDPOINT any more.
agentlightning/tracer/agentops.py
Outdated
| yield processor | ||
| except Exception as e: | ||
| status = StatusCode.ERROR # type: ignore | ||
| logger.debug(f"Trace failed for rollout_id={rollout_id}, attempt_id={attempt_id}, error={e}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that should be logger.error at least
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes changed.
agentlightning/tracer/agentops.py
Outdated
| finally: | ||
| agentops.end_trace(trace, end_state=status) # type: ignore | ||
| elif store is None and rollout_id is None and attempt_id is None: | ||
| with self._lightning_span_processor: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what do you think?
|
/ci |
|
🚀 CI Watcher for correlation id-3454803530-mha6v8ny triggered by comment 3454803530
✅ All runs completed. |
Fixed. |
9b2efe2 to
a306fc2
Compare
| else: | ||
| raise ValueError("store, rollout_id, and attempt_id must be either all provided or all None") | ||
| except Exception as e: | ||
| status = StatusCode.ERROR # type: ignore |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have you tested whether the exception inside the trace_context will be caught here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, unless an exception is handled in def _exit (_, exce_type, exce, tb), it can be caught externally.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please add a test case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added
| kwargs: dict[str, Any] = {} | ||
| if name is not None: | ||
| kwargs["trace_name"] = str(name) | ||
| trace = agentops.start_trace(**kwargs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does start_trace have other kwargs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Args:
trace_name: Name for the trace (e.g., "session", "my_custom_task").
tags: Optional tags to attach to the trace span (list of strings or dict).
maybe we can use: agentops.start_trace(trace_name=str(name))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if name is None, we can use rollout_id?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added
|
/ci |
|
🚀 CI Watcher for correlation id-3456516543-mham3qlk triggered by comment 3456516543
✅ All runs completed. |
|
@acured Please avoid force push. This will toss away the diff and make review difficult. When merging, we will squash. So the chaotic commit history on the feature branch does not matter anyway. |
OK |
|
/ci |
|
🚀 CI Watcher for correlation id-3472335385-mhepb2dm triggered by comment 3472335385
✅ All runs completed. |
|
🚀 CI Watcher for correlation id-3472679590-mhes6f6j triggered by comment 3472679590
✅ All runs completed. |
--------- Co-authored-by: Hao Ni (CSI Interfusion Co Ltd) <[email protected]>
Issue:
agentops mock service and OTL batchprocessor shutdown are not synchronized:
🖇 AgentOps: Network error during span export: HTTPConnectionPool(host='localhost', port=41733): Max retries exceeded with url: /traces (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7e0a521ffd10>: Failed to establish a new connection: [Errno 111] Connection refused'))
Bug: https://github.com/microsoft/agent-lightning/issues/51
Fixed: