Hi, I’ve been trying to replicate the HumanEval results reported in the paper for the Instruct model, but I’m only achieving around 35% accuracy.
A large fraction of responses (~40%, depending on inference settings) come out empty, consisting entirely of <end_of_text> tokens.
Here are some of the inference settings I’ve tried:
temperature=0.1, top_p=0.9, alg=entropy
temperature=0.2, top_p=0.95, alg=entropy
temperature=0.1, top_p=0.9, alg=origin
temperature=0.5, top_p=0.9, alg=origin
temperature=1.0, top_p=0.9, alg=origin
Could you provide guidance on:
- How to avoid the empty-response issue, and
- What inference settings were used for the reported HumanEval results?
Thanks!