Skip to content

Clarification on SlimPajama (>64k) token counts reported in ProLong #15

@HaoranDeng

Description

@HaoranDeng

First of all, thank you for your excellent work and for making your resources publicly available. I have been carefully going through the ProLong paper and attempting to reproduce some of the reported data statistics.

In the paper, it is mentioned that after filtering SlimPajama for sequences longer than 64k, the Book subset contains approximately 33B tokens, and the CommonCrawl (CC) subset contains about 15B tokens.

In my own reproduction, however, I obtained noticeably smaller numbers:

Book: ~22B tokens

CC: ~10B tokens

This is roughly one-third fewer tokens than what is reported. Moreover, according to the SlimPajama-627B documentation, the Book portion constitutes only about 4.2% of the entire dataset, which would correspond to fewer than ~27B tokens in total. This makes the figure of 33B tokens for the >64k subset somewhat puzzling.

I would greatly appreciate it if you could clarify:

Were there additional filtering or preprocessing steps applied in ProLong that I may have overlooked?

Or are the reported numbers approximate, or perhaps derived from a different release/version of SlimPajama?

Thank you very much for your time and for any clarification you can provide.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions