Difficulty Reproducing HumanEval Results for Instruct Model

Hi, I’ve been trying to replicate the HumanEval results reported in the paper for the Instruct model, but I’m only achieving around 35% accuracy.

A large fraction of responses (~40%, depending on inference settings) come out empty, consisting entirely of <end_of_text> tokens.

Here are some of the inference settings I’ve tried:

temperature=0.1, top_p=0.9, alg=entropy
temperature=0.2, top_p=0.95, alg=entropy
temperature=0.1, top_p=0.9, alg=origin
temperature=0.5, top_p=0.9, alg=origin
temperature=1.0, top_p=0.9, alg=origin

Could you provide guidance on:
- How to avoid the empty-response issue, and
- What inference settings were used for the reported HumanEval results?

Thanks!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Difficulty Reproducing HumanEval Results for Instruct Model #61

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Difficulty Reproducing HumanEval Results for Instruct Model #61

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions