Fail to Reproduce WebArena Results with GenericAgent-GPT-4o

Dear AgentLab Authors, 

Thank you for the great work! I'm trying to reproduce the WebArena Results with GenericAgent-GPT-4o. In particular, I used the following code. Everything should just follow AgentLab's default. However the number I got is 25 which is significantly lower than 31.4 as shown on the [BrowserGym Leaderboard](https://huggingface.co/spaces/ServiceNow/browsergym-leaderboard). Do you have any suggestions for the reproduction? Any code available to reproduce the performance ~31? 

Thanks again for you great contribution to the community!

```
from agentlab.agents.generic_agent import AGENT_4o 

from agentlab.experiments.study import make_study
from agentlab.experiments.study import Study

study = make_study(
    benchmark="webarena", 
    agent_args=[AGENT_4o],
    comment="repo 4o agent",
)



study.run(n_jobs=5)
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fail to Reproduce WebArena Results with GenericAgent-GPT-4o #249

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Fail to Reproduce WebArena Results with GenericAgent-GPT-4o #249

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions