The repository is the code for the paper Licence to Scale: A Microservice Simulation Environment for Benchmarking Agentic AI developed for the NeurIPS 2025 Workshop on Scaling Environments for Agents (SEA). The full paper can be seen here
It provides a causality-driven simulation environment for cloud microservices designed to benchmark reinforcement learning (RL) and multi-agentic (LLM-based) systems on constrained optimisation problems such as autoscaling.
Large Language Model (LLM)-based agents can reason and plan but often lack realistic domain models. This simulation environment bridges that gap by modelling causal relations between CPU usage, memory usage, resource limits, and latency in distributed microservice systems.
The environment enables researchers to:
- Evaluate RL and multi-agent systems in a controlled, reproducible setting.
- Simulate latency propagation through causal graphs.
- Inject system faults (e.g., CPU leaks, memory leaks, degradation events).
- Compare agents (LLMs, DQN, rule-based) across standardised tasks.
-
Causal latency modelling using a structural vector autoregressive process.
-
Gymnasium-compatible RL environment.
-
Service-level fault injection:
- CPU or memory leaks
- Service degradation
- Random load spikes
-
Configurable scaling actions: vertical (CPU/memory) or horizontal (pods).
-
Built-in benchmark tasks:
- Easy – E-commerce Platform
- Intermediate – Social Media Platform
- Hard – Enterprise Financial Platform
git clone https://github.com/itbench-hub/ITBench-Exploratory-Simulator
cd ITBench-Exploratory-Simulator
poetry install- Python ≥ 3.10
gymnasium,numpy,scipy,networkx,matplotlib,pydantic,seaborn,pandas- Optional:
langchain,langchain-community,langgraph,torch,ollamafor multi-agent experiments.
To see an in depth tutorial go tho the exampleSimulation Notebook. For a Tutorial on how to inject failures we refer to the FailureInjection Notebook and ServiceDeficit Notebook.
A minimal example of initializing and running a two-service environment:
from l2sbench.Simulation.simulatedservice import SimulatedService
from l2sbench.Simulation.simulatedenv import SpikeMicroserviceEnv
from functools import partial
from scipy.stats import truncexpon
# Define two connected services
service1 = SimulatedService(
name="service1",
max_memory=2048,
max_cpu=2.0,
max_pods=3,
lat_func=partial(truncexpon.rvs, loc=0, size=1, b=1, scale=0.5),
enable_cpu_leak=True,
cpu_leak_probability=0.01,
leak_recovery_probability=0.001,
cpu_leak_rate=0.01,
enable_degradation=True,
degradation_probability=0.005,
recovery_probability=0.02,
degradation_latency_penalty=40.0
)
service2 = SimulatedService(
name="service2",
parent_services=[service1],
max_memory=1028,
max_cpu=1.0,
max_pods=3,
lat_func=partial(truncexpon.rvs, loc=0, size=1, b=1, scale=0.5),
enable_memory_leak=True,
memory_leak_probability=0.01,
leak_recovery_probability=0.005,
memory_leak_rate=5.0,
enable_degradation=True,
degradation_probability=0.005,
recovery_probability=0.02,
degradation_latency_penalty=50.0
)
# Create environment
env = SpikeMicroserviceEnv(
services=[service1, service2],
terminal_service="service2",
target_latency=20.0,
alpha=5.0,
beta=1.0,
max_steps=300
)
obs, info = env.reset()
for i in range(100):
action = env.action_space.sample() # random action
obs, reward, done, truncated, info = env.step(action)
print(f"Step reward: {reward:.2f}")| Parameter | Default | Description |
|---|---|---|
| General Service Parameters | ||
name |
— | Name of the service |
max_memory |
— | Max memory allocation (MB) |
max_cpu |
— | Max CPU allocation (cores) |
max_pods |
— | Max replicas per service |
parent_services |
[] |
Upstream dependencies |
cpu_dependent |
True |
Whether latency depends mainly on CPU |
| Latency Parameters | ||
pod_influence_decay |
2 | Decay of pod impact on latency |
target_latency |
— | Target latency (ms) |
| Workload/Spikes | ||
enable_random_spikes |
True | Random request bursts |
spike_schedule |
None | Optional manual spike schedule |
| Degradation | ||
enable_degradation |
False | Enable degradation events |
degradation_probability |
0.001 | Probability per step |
recovery_probability |
0.01 | Recovery probability |
degradation_latency_penalty |
50.0 | Extra latency (ms) when degraded |
| Leaks | ||
enable_cpu_leak |
False | Enable CPU leak |
cpu_leak_probability |
0.001 | Probability per step |
cpu_leak_rate |
0.002 | Leak growth rate |
enable_memory_leak |
False | Enable memory leak |
memory_leak_probability |
0.001 | Probability per step |
memory_leak_rate |
2.0 | Leak growth rate (MB/step) |
leak_recovery_probability |
0.005 | Recovery probability |
| Environment | ||
alpha |
1.0 | Reward weighting for latency violations |
beta |
1.0 | Reward penalty for resource cost |
max_steps |
500 | Simulation duration |
| Challenge | Services (d) | Edges (V) | Target Latency | Description |
|---|---|---|---|---|
| Easy | 5 | 5 | 35 ms | Slightly underprovisioned e-commerce platform |
| Intermediate | 9 | 11 | 90 ms | Social media platform with CPU leak in user-service |
| Hard | 12 | 17 | 135 ms | Financial platform with degradation and CPU leak |
The repository includes reference implementations for:
- Master Policy (rule-based gold standard)
- Global DQN Agent
- LLM-Multi-Agent System (LangChain-based)
- Lazy Agent (no action baseline)
Each agent interacts with the environment through the Gymnasium interface:
from l2sbench.Benchmarks.masteragent import Masteragent1, Masteragent2, Masteragent3 # Masteragent for each env
from l2sbench.Benchmarks.lazyagent import Lazyagent # Dummy Agent
from l2sbench.Benchmarks.combinedsystem import CombinedScalingSystem #LLM-MUltiagent needs to be trained
from l2sbench.Benchmarks.dqnagent import DQNAgent # global DQN-Agent needs to be trained
agent = Lazyagent()
obs, info = env.reset()
for i in range(5)
action = agent.get_action(env, i): # differs for each agent
obs, reward, done, truncated, info = env.step(action)These are the results reported in the paper on the experimental evaluation setup. Please not that none of the algorithms aim to solve the challenges perfectly, they just highlight how one can use the environment to setup benchmark tasks. The code to rerun all the experiments is located in experiments in the Notebooks folder.
| Easy | Intermediate | Hard | |||||||
|---|---|---|---|---|---|---|---|---|---|
| Method | Reward | Violations | Actions | Reward | Violations | Actions | Reward | Violations | Actions |
| Master Policy | −5.93 ± 0.00 ± 0.03 | 0 % | 2 | −14.18 ± 3.85 | 8 % | 25 | −242.23 ± 124.87 | 79 % | 28 |
| Global DQN | 6.63 ± 0.42 | 0 % | 100 | −14.94 ± 5.52 | 8 % | 100 | −298.68 ± 132.46 | 95 % | 100 |
| LLM-Multi-Agent | −6.01 ± 0.63 | 3 % | 3 | −15.31 ± 1.70 | 7 % | 77 | −262.63 ± 99.03 | 87 % | 100 |
| Lazy-Agent | −6.86 ± 3.71 | 23 % | 0 | −19.73 ± 6.94 | 57 % | 0 | −283.08 ± 120.98 | 85 % | 0 |
If you use this environment, please cite the following paper:
@misc{lohse2025licence,
title={Licence to Scale: A Microservice Simulation Environment for Benchmarking Agentic AI},
author={Lohse, Christopher and Selk, Adrian and Ba, Amadou and Wahl, Jonas and Ruffini, Marco},
booktitle={NeurIPS 2025 Workshop on Scaling Environments for Agents (SEA)},
year={2025}
}This project is released under the Apache 2.0 License. See LICENSE for details.
