Skip to content

itbench-hub/ITBench-Exploratory-Simulator

Repository files navigation

L2SBench: A Microservice Simulation Environment for Benchmarking Agentic AI

image

The repository is the code for the paper Licence to Scale: A Microservice Simulation Environment for Benchmarking Agentic AI developed for the NeurIPS 2025 Workshop on Scaling Environments for Agents (SEA). The full paper can be seen here

It provides a causality-driven simulation environment for cloud microservices designed to benchmark reinforcement learning (RL) and multi-agentic (LLM-based) systems on constrained optimisation problems such as autoscaling.


Overview

Large Language Model (LLM)-based agents can reason and plan but often lack realistic domain models. This simulation environment bridges that gap by modelling causal relations between CPU usage, memory usage, resource limits, and latency in distributed microservice systems.

The environment enables researchers to:

  • Evaluate RL and multi-agent systems in a controlled, reproducible setting.
  • Simulate latency propagation through causal graphs.
  • Inject system faults (e.g., CPU leaks, memory leaks, degradation events).
  • Compare agents (LLMs, DQN, rule-based) across standardised tasks.

Features

  • Causal latency modelling using a structural vector autoregressive process.

  • Gymnasium-compatible RL environment.

  • Service-level fault injection:

    • CPU or memory leaks
    • Service degradation
    • Random load spikes
  • Configurable scaling actions: vertical (CPU/memory) or horizontal (pods).

  • Built-in benchmark tasks:

    • Easy – E-commerce Platform
    • Intermediate – Social Media Platform
    • Hard – Enterprise Financial Platform

Installation

git clone https://github.com/itbench-hub/ITBench-Exploratory-Simulator
cd ITBench-Exploratory-Simulator
poetry install

Requirements

  • Python ≥ 3.10
  • gymnasium, numpy, scipy, networkx, matplotlib, pydantic, seaborn, pandas
  • Optional: langchain, langchain-community, langgraph, torch, ollama for multi-agent experiments.

Quick Start Example

To see an in depth tutorial go tho the exampleSimulation Notebook. For a Tutorial on how to inject failures we refer to the FailureInjection Notebook and ServiceDeficit Notebook.

A minimal example of initializing and running a two-service environment:

from l2sbench.Simulation.simulatedservice import SimulatedService
from l2sbench.Simulation.simulatedenv import SpikeMicroserviceEnv
from functools import partial
from scipy.stats import truncexpon

# Define two connected services
service1 = SimulatedService(
    name="service1",
    max_memory=2048,
    max_cpu=2.0,
    max_pods=3,
    lat_func=partial(truncexpon.rvs, loc=0, size=1, b=1, scale=0.5),
    enable_cpu_leak=True,
    cpu_leak_probability=0.01,
    leak_recovery_probability=0.001,
    cpu_leak_rate=0.01,
    enable_degradation=True,
    degradation_probability=0.005,
    recovery_probability=0.02,
    degradation_latency_penalty=40.0
)

service2 = SimulatedService(
    name="service2",
    parent_services=[service1],
    max_memory=1028,
    max_cpu=1.0,
    max_pods=3,
    lat_func=partial(truncexpon.rvs, loc=0, size=1, b=1, scale=0.5),
    enable_memory_leak=True,
    memory_leak_probability=0.01,
    leak_recovery_probability=0.005,
    memory_leak_rate=5.0,
    enable_degradation=True,
    degradation_probability=0.005,
    recovery_probability=0.02,
    degradation_latency_penalty=50.0
)

# Create environment
env = SpikeMicroserviceEnv(
    services=[service1, service2],
    terminal_service="service2",
    target_latency=20.0,
    alpha=5.0,
    beta=1.0,
    max_steps=300
)

obs, info = env.reset()
for i in range(100):
    action = env.action_space.sample()  # random action
    obs, reward, done, truncated, info = env.step(action)
    print(f"Step reward: {reward:.2f}")

Environment Parameters

Parameter Default Description
General Service Parameters
name Name of the service
max_memory Max memory allocation (MB)
max_cpu Max CPU allocation (cores)
max_pods Max replicas per service
parent_services [] Upstream dependencies
cpu_dependent True Whether latency depends mainly on CPU
Latency Parameters
pod_influence_decay 2 Decay of pod impact on latency
target_latency Target latency (ms)
Workload/Spikes
enable_random_spikes True Random request bursts
spike_schedule None Optional manual spike schedule
Degradation
enable_degradation False Enable degradation events
degradation_probability 0.001 Probability per step
recovery_probability 0.01 Recovery probability
degradation_latency_penalty 50.0 Extra latency (ms) when degraded
Leaks
enable_cpu_leak False Enable CPU leak
cpu_leak_probability 0.001 Probability per step
cpu_leak_rate 0.002 Leak growth rate
enable_memory_leak False Enable memory leak
memory_leak_probability 0.001 Probability per step
memory_leak_rate 2.0 Leak growth rate (MB/step)
leak_recovery_probability 0.005 Recovery probability
Environment
alpha 1.0 Reward weighting for latency violations
beta 1.0 Reward penalty for resource cost
max_steps 500 Simulation duration

Benchmark Scenarios

Challenge Services (d) Edges (V) Target Latency Description
Easy 5 5 35 ms Slightly underprovisioned e-commerce platform
Intermediate 9 11 90 ms Social media platform with CPU leak in user-service
Hard 12 17 135 ms Financial platform with degradation and CPU leak

Agents

The repository includes reference implementations for:

  • Master Policy (rule-based gold standard)
  • Global DQN Agent
  • LLM-Multi-Agent System (LangChain-based)
  • Lazy Agent (no action baseline)

Each agent interacts with the environment through the Gymnasium interface:

from l2sbench.Benchmarks.masteragent import Masteragent1, Masteragent2, Masteragent3 # Masteragent for each env
from l2sbench.Benchmarks.lazyagent import Lazyagent # Dummy Agent
from l2sbench.Benchmarks.combinedsystem import CombinedScalingSystem #LLM-MUltiagent needs to be trained
from l2sbench.Benchmarks.dqnagent import DQNAgent # global DQN-Agent needs to be trained
agent = Lazyagent()
obs, info = env.reset()
for i in range(5)
 action = agent.get_action(env, i): # differs for each agent
 obs, reward, done, truncated, info = env.step(action)

Experimental Results

These are the results reported in the paper on the experimental evaluation setup. Please not that none of the algorithms aim to solve the challenges perfectly, they just highlight how one can use the environment to setup benchmark tasks. The code to rerun all the experiments is located in experiments in the Notebooks folder.

Easy Intermediate Hard
Method Reward Violations Actions Reward Violations Actions Reward Violations Actions
Master Policy −5.93 ± 0.00 ± 0.03 0 % 2 −14.18 ± 3.85 8 % 25 −242.23 ± 124.87 79 % 28
Global DQN 6.63 ± 0.42 0 % 100 −14.94 ± 5.52 8 % 100 −298.68 ± 132.46 95 % 100
LLM-Multi-Agent −6.01 ± 0.63 3 % 3 −15.31 ± 1.70 7 % 77 −262.63 ± 99.03 87 % 100
Lazy-Agent −6.86 ± 3.71 23 % 0 −19.73 ± 6.94 57 % 0 −283.08 ± 120.98 85 % 0

Citing

If you use this environment, please cite the following paper:

@misc{lohse2025licence,
  title={Licence to Scale: A Microservice Simulation Environment for Benchmarking Agentic AI},
  author={Lohse, Christopher and Selk, Adrian and Ba, Amadou and Wahl, Jonas and Ruffini, Marco},
  booktitle={NeurIPS 2025 Workshop on Scaling Environments for Agents (SEA)},
  year={2025}
}

License

This project is released under the Apache 2.0 License. See LICENSE for details.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published