Skip to content

Misalignment in ProLong-64k Reproduction with HELMET Testing: Re-Rank Outputs Content Instead of IDs, and Performance Issues in Data (4B, 8B) Models #19

@mckvg

Description

@mckvg

While reproducing ProLong-64k starting from Llama-3-8B-Instruct, using the official 64k continued pre-training data and the same SFT stage with UltraChat (~1B tokens), I noticed some clear misalignment behaviors when evaluating with HELMET. Currently, the ProLong-64k-4B and ProLong-64k-8B models have been reproduced and tested with chat SFT, followed by HELMET evaluation. Their re-rank scores are around 15. When examining specific examples, many responses do not output unique IDs but instead provide the content of the questions, for example:

ProLong-64k-Data(04B)-Instruct ❌
🔍 Query:
What is the most popular food in Switzerland?
📊 Expected Ranking:
8344031 > 6230222 > 2206194 > 3437288 > 3437281 > 6230227 > 6230221 > 6230226 > 5167510 > 3437284 >
🤖 Model Output:
Ranking: Fondue > Rösti > Geschnetzeltes > Cheese-Rosti > Cheese Fondue > Raclette > Berner Platte > Zürcher Geschnetzeltes > Zürcher Eintopf > Spätzle > Älplermagronen > Appenzeller > Gruyere > Emmental > Vacherin > Appenzeller cheese > Sbrinz cheese > Tête de Moine cheese > Blue cheese > Raclette
Metrics (NDCG@10):
NDCG@10: 0

How can this issue be improved? I attempted to address this by modifying the re-rank prompt template, but it did not yield significant improvements.

Additionally, when reproducing the ProLong-64k-4B model, the average recall score on the 64k task was only 77.

After reproducing the ProLong-64k-8B model, no improvement was observed in the ICL (In-Context Learning) and re-rank tasks, which contradicts the peak performance data for the 8B model (range in 64k 20B) in the ablation table provided in your ProLong results.

Would you mind helping me with the these small questions above when you’re free? Thank you very much in advance!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions