This benchmark investigates the idea of using LLMs as interpreters for programming languages.
- Conda is used for managing dependencies and creating virtual environment to run the experiments.
-
Navigate to the root directory containing
env.yamland execute the command below to install project dependenciesconda env create -f env.yaml
-
In the root directory, execute the below command to start the created virtual environment.
conda activate llm-interpreter
data/: directory contains the raw data in.jsonlresults/: models' generated results will be written to this directorymodel_configscontains the config files for model inference, including hyper-parameter
We provide the code to run gpt-4o models on all four tasks in PLSemanticsBench.
The results will be written to the results/ directory.
# the default one is gpt-4o
python main.pyPlease use the following citations if you found our work to be useful in your work.
@inproceedings{plsemanticsbench,
title={PLSemanticsBench: Large Language Models are Bad Programming Language Interpreters},
author={Jiyang Zhang and Aditya Thimmaiah and Samuel Yuan and Junyi Jessy Li and Milos Gligoric},
booktitle={},
year={2025},
url={}
}