EnConda-Bench: Process-Level Trajectory Evaluation for Environment Configuration in Software Engineering Agents
A comprehensive benchmark framework for evaluating AI agents' performance on Python environment configuration tasks.
EnConda-Bench is an end-to-end environment configuration benchmark system specifically designed to evaluate the capabilities of large language models and AI agents in identifying, analyzing, and fixing Python environment configuration errors. The system provides a complete dataset, inference tools, and evaluation framework.
EnConda-Bench/
├── Benchmark_Data/ # Benchmark Dataset
│ ├── error_types.json # Error type definitions
│ ├── Enconda_benchmark_data.jsonl # Main benchmark data
│ └── final_output_benchmark_data_final/ # Processed dataset
├── Inference/ # Inference System
│ ├── core/ # Core inference modules
│ ├── configs/ # Configuration files
│ ├── scripts/ # Run scripts
│ └── docs/ # Documentation
├── Evaluation/ # Evaluation System
│ ├── Evaluate/ # Automated metric evaluation
│ └── Execution/ # End-to-end execution testing
└── Dockerfiles/ # Docker configuration files
- Error Type Definitions: 6 major environment configuration error types (E1-E8)
- Real-world Data: Based on real GitHub repository README files
- Standardized Format: Structured JSON/JSONL data format
- Original Data Source: Raw repositories available from HuggingFace Dataset
| Type | Name | Description |
|---|---|---|
| E1 | Dependency Installation Error | Missing dependencies, version errors, etc. |
| E2 | Command Usage or Syntax Error | Incorrect commands, parameters, or syntax |
| E4 | File Path or Missing File Error | Path errors or non-existent files |
| E6 | Logical Order Error | Incorrect installation step sequence |
| E7 | Version Compatibility Error | Version conflicts or incompatibilities |
| E8 | Other Miscellaneous Errors | Formatting issues, unclear descriptions, etc. |
Agent inference system supporting two modes:
- LLM Mode: Direct analysis based on large language models
- Agent Mode: Interactive analysis based on intelligent agents
- 🔍 Automatic error detection and classification
- 🛠️ Fix script generation
- 📊 Batch processing and analysis
- 📈 Detailed statistical reports
Dual evaluation framework:
- Error Type Accuracy: Precision, recall, F1-score
- Text Similarity: LLM-based semantic similarity evaluation
- Intelligent Matching: Handles inconsistencies between predictions and ground truth
- Docker Containers: Isolated execution environment
- Script Validation: Actual execution of generated configuration scripts
- Success Rate Statistics: Environment configuration success rate analysis
- Error Diagnosis: Detailed execution logs and error analysis
- Python 3.10+
- Docker Engine
- 8GB+ RAM (recommended)
- OpenAI API key (for LLM evaluation)
# Clone the project
git clone <repository-url>
cd EnConda-Bench
# Install Python dependencies
pip install -r requirements.txt
# Or use uv (recommended)
uv sync# Pull pre-built images
docker pull ghcr.io/research-org/envbench-python:latest
# Or build local images
docker build -f Dockerfiles/python.Dockerfile -t envbench-python .- Create environment variables file:
cp .env.example .env
# Edit .env file and add your OpenAI API key- Configure inference parameters (
Inference/configs/llm_config.yaml):
openai:
api_key: "your-openai-api-key"
model_name: "gpt-4"
base_url: "https://api.openai.com/v1"cd Inference
# LLM mode
python run.py --mode llm --config configs/llm_config.yaml
# Agent mode
python run.py --mode agent --config configs/agent_config.yamlcd Evaluation/Evaluate
python run_evaluation.py \
--results_dir /path/to/inference/results \
--data_root_dir /path/to/Benchmark_Data \
--output_dir evaluation_outputcd Evaluation/Execution
# Convert data format
python convert_to_jsonl.py \
--input_file inference_results.jsonl \
--output_file execution_input.jsonl
# Run execution tests
./uv_run.shAfter evaluation completion, results will be saved in the specified output directory:
detailed_evaluation_results.json: Detailed evaluation resultsevaluation_summary.json: Summary statisticsresults.jsonl: Execution test results
- Error Type Recognition: Precision, recall, F1-score
- Description Accuracy: Semantic similarity of error descriptions
- Fix Solution Quality: Semantic similarity of fix solutions
- Environment Configuration Success Rate: Proportion of successful script executions
- Clean Pass Rate: Proportion of environment configurations with no issues
- Error Diagnosis: Specific failure cause analysis
- Update
Benchmark_Data/error_types.json - Add corresponding detection logic in the inference system
- Update evaluation metric calculations
- Create new Dockerfile (
Dockerfiles/new_language.Dockerfile) - Add language-specific configuration files
- Extend inference and evaluation logic
- Add new client in
Inference/core/clients/ - Update configuration file format
- Test compatibility
- Use batch processing to reduce API calls
- Enable result caching
- Parallel processing of multiple files
- Adjust Docker resource limits
- Use SSD storage for improved I/O performance
- Set reasonable concurrent worker counts
This project is licensed under an open source license. See the LICENSE file for details.
- Thanks to the EnvBench project for providing the foundational framework
- Thanks to all contributors for their efforts
- Thanks to the open source community for their support
If you find this work useful, please give us a ⭐️ and consider citing:
@article{EnConda_Bench,
title={Process-Level Trajectory Evaluation for Environment Configuration in Software Engineering Agents},
author={Kuang, Jiayi and Li, Yinghui and Zhang, Xin and Li, Yangning and Yin, Di and Sun, Xing and Shen, Ying and Yu, Philip S},
journal={arXiv preprint arXiv:2510.25694},
url={https://arxiv.org/abs/2510.25694},
year={2025}
}