L2SBench: A Microservice Simulation Environment for Benchmarking Agentic AI

The repository is the code for the paper Licence to Scale: A Microservice Simulation Environment for Benchmarking Agentic AI developed for the NeurIPS 2025 Workshop on Scaling Environments for Agents (SEA). The full paper can be seen here

It provides a causality-driven simulation environment for cloud microservices designed to benchmark reinforcement learning (RL) and multi-agentic (LLM-based) systems on constrained optimisation problems such as autoscaling.

Overview

Large Language Model (LLM)-based agents can reason and plan but often lack realistic domain models. This simulation environment bridges that gap by modelling causal relations between CPU usage, memory usage, resource limits, and latency in distributed microservice systems.

The environment enables researchers to:

Evaluate RL and multi-agent systems in a controlled, reproducible setting.
Simulate latency propagation through causal graphs.
Inject system faults (e.g., CPU leaks, memory leaks, degradation events).
Compare agents (LLMs, DQN, rule-based) across standardised tasks.

Features

Causal latency modelling using a structural vector autoregressive process.
Gymnasium-compatible RL environment.
Service-level fault injection:
- CPU or memory leaks
- Service degradation
- Random load spikes
Configurable scaling actions: vertical (CPU/memory) or horizontal (pods).
Built-in benchmark tasks:
- Easy – E-commerce Platform
- Intermediate – Social Media Platform
- Hard – Enterprise Financial Platform

Installation

git clone https://github.com/itbench-hub/ITBench-Exploratory-Simulator
cd ITBench-Exploratory-Simulator
poetry install

Requirements

Python ≥ 3.10
gymnasium, numpy, scipy, networkx, matplotlib, pydantic, seaborn, pandas
Optional: langchain, langchain-community, langgraph, torch, ollama for multi-agent experiments.

Quick Start Example

To see an in depth tutorial go tho the exampleSimulation Notebook. For a Tutorial on how to inject failures we refer to the FailureInjection Notebook and ServiceDeficit Notebook.

A minimal example of initializing and running a two-service environment:

from l2sbench.Simulation.simulatedservice import SimulatedService
from l2sbench.Simulation.simulatedenv import SpikeMicroserviceEnv
from functools import partial
from scipy.stats import truncexpon

# Define two connected services
service1 = SimulatedService(
    name="service1",
    max_memory=2048,
    max_cpu=2.0,
    max_pods=3,
    lat_func=partial(truncexpon.rvs, loc=0, size=1, b=1, scale=0.5),
    enable_cpu_leak=True,
    cpu_leak_probability=0.01,
    leak_recovery_probability=0.001,
    cpu_leak_rate=0.01,
    enable_degradation=True,
    degradation_probability=0.005,
    recovery_probability=0.02,
    degradation_latency_penalty=40.0
)

service2 = SimulatedService(
    name="service2",
    parent_services=[service1],
    max_memory=1028,
    max_cpu=1.0,
    max_pods=3,
    lat_func=partial(truncexpon.rvs, loc=0, size=1, b=1, scale=0.5),
    enable_memory_leak=True,
    memory_leak_probability=0.01,
    leak_recovery_probability=0.005,
    memory_leak_rate=5.0,
    enable_degradation=True,
    degradation_probability=0.005,
    recovery_probability=0.02,
    degradation_latency_penalty=50.0
)

# Create environment
env = SpikeMicroserviceEnv(
    services=[service1, service2],
    terminal_service="service2",
    target_latency=20.0,
    alpha=5.0,
    beta=1.0,
    max_steps=300
)

obs, info = env.reset()
for i in range(100):
    action = env.action_space.sample()  # random action
    obs, reward, done, truncated, info = env.step(action)
    print(f"Step reward: {reward:.2f}")

Environment Parameters

Parameter	Default	Description
General Service Parameters
`name`	—	Name of the service
`max_memory`	—	Max memory allocation (MB)
`max_cpu`	—	Max CPU allocation (cores)
`max_pods`	—	Max replicas per service
`parent_services`	`[]`	Upstream dependencies
`cpu_dependent`	`True`	Whether latency depends mainly on CPU
Latency Parameters
`pod_influence_decay`	2	Decay of pod impact on latency
`target_latency`	—	Target latency (ms)
Workload/Spikes
`enable_random_spikes`	True	Random request bursts
`spike_schedule`	None	Optional manual spike schedule
Degradation
`enable_degradation`	False	Enable degradation events
`degradation_probability`	0.001	Probability per step
`recovery_probability`	0.01	Recovery probability
`degradation_latency_penalty`	50.0	Extra latency (ms) when degraded
Leaks
`enable_cpu_leak`	False	Enable CPU leak
`cpu_leak_probability`	0.001	Probability per step
`cpu_leak_rate`	0.002	Leak growth rate
`enable_memory_leak`	False	Enable memory leak
`memory_leak_probability`	0.001	Probability per step
`memory_leak_rate`	2.0	Leak growth rate (MB/step)
`leak_recovery_probability`	0.005	Recovery probability
Environment
`alpha`	1.0	Reward weighting for latency violations
`beta`	1.0	Reward penalty for resource cost
`max_steps`	500	Simulation duration

Benchmark Scenarios

Challenge	Services (d)	Edges (V)	Target Latency	Description
Easy	5	5	35 ms	Slightly underprovisioned e-commerce platform
Intermediate	9	11	90 ms	Social media platform with CPU leak in `user-service`
Hard	12	17	135 ms	Financial platform with degradation and CPU leak

Agents

The repository includes reference implementations for:

Master Policy (rule-based gold standard)
Global DQN Agent
LLM-Multi-Agent System (LangChain-based)
Lazy Agent (no action baseline)

Each agent interacts with the environment through the Gymnasium interface:

from l2sbench.Benchmarks.masteragent import Masteragent1, Masteragent2, Masteragent3 # Masteragent for each env
from l2sbench.Benchmarks.lazyagent import Lazyagent # Dummy Agent
from l2sbench.Benchmarks.combinedsystem import CombinedScalingSystem #LLM-MUltiagent needs to be trained
from l2sbench.Benchmarks.dqnagent import DQNAgent # global DQN-Agent needs to be trained
agent = Lazyagent()
obs, info = env.reset()
for i in range(5)
 action = agent.get_action(env, i): # differs for each agent
 obs, reward, done, truncated, info = env.step(action)

Experimental Results

These are the results reported in the paper on the experimental evaluation setup. Please not that none of the algorithms aim to solve the challenges perfectly, they just highlight how one can use the environment to setup benchmark tasks. The code to rerun all the experiments is located in experiments in the Notebooks folder.

		Easy			Intermediate			Hard
Method	Reward	Violations	Actions	Reward	Violations	Actions	Reward	Violations	Actions
Master Policy	−5.93 ± 0.00 ± 0.03	0 %	2	−14.18 ± 3.85	8 %	25	−242.23 ± 124.87	79 %	28
Global DQN	6.63 ± 0.42	0 %	100	−14.94 ± 5.52	8 %	100	−298.68 ± 132.46	95 %	100
LLM-Multi-Agent	−6.01 ± 0.63	3 %	3	−15.31 ± 1.70	7 %	77	−262.63 ± 99.03	87 %	100
Lazy-Agent	−6.86 ± 3.71	23 %	0	−19.73 ± 6.94	57 %	0	−283.08 ± 120.98	85 %	0

Citing

If you use this environment, please cite the following paper:

@misc{lohse2025licence,
  title={Licence to Scale: A Microservice Simulation Environment for Benchmarking Agentic AI},
  author={Lohse, Christopher and Selk, Adrian and Ba, Amadou and Wahl, Jonas and Ruffini, Marco},
  booktitle={NeurIPS 2025 Workshop on Scaling Environments for Agents (SEA)},
  year={2025}
}

License

This project is released under the Apache 2.0 License. See LICENSE for details.

Name		Name	Last commit message	Last commit date
Latest commit History 25 Commits
Notebooks		Notebooks
figures		figures
l2sbench		l2sbench
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
LICENSE		LICENSE
README.md		README.md
paper.pdf		paper.pdf
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

L2SBench: A Microservice Simulation Environment for Benchmarking Agentic AI

Overview

Features

Installation

Requirements

Quick Start Example

Environment Parameters

Benchmark Scenarios

Agents

Experimental Results

Citing

License

About

Uh oh!

Releases

Packages

Languages

License

itbench-hub/ITBench-Exploratory-Simulator

Folders and files

Latest commit

History

Repository files navigation

L2SBench: A Microservice Simulation Environment for Benchmarking Agentic AI

Overview

Features

Installation

Requirements

Quick Start Example

Environment Parameters

Benchmark Scenarios

Agents

Experimental Results

Citing

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages