Hi, I failed to reproduce the Llama2-7b-4k (w/o SFT) in the paper.
Here is our result:
| Methods |
Tokens |
Coursera |
GSM |
QuALITY |
TOEFL |
CodeU |
SFiction |
Avg |
| (L-Eval)Llama2-7b-4k (w/o SFT) |
4k |
20.05 |
2.0 |
28.71 |
24.53 |
0.00 |
40.62 |
19.31 |
| (Ours) Llama2-7b-4k (w/o SFT) |
4k |
15.26 |
19.0 |
30.69 |
13.01 |
3.33 |
35.93 |
19.54 |
Here is our experimental setting:
We change the llama2-chat-test.py file, disable the NTK parameters and using LLama2-7b to conduct the evaluation.
And run like this:
python3 Baselines/llama2-chat-test.py
--scale 7b
--max_length 4k
--metric exam_eval
What's the possible reason for that ? Should I adjust the prompt or other pamameters?