Clarification on SlimPajama (>64k) token counts reported in ProLong

First of all, thank you for your excellent work and for making your resources publicly available. I have been carefully going through the ProLong paper and attempting to reproduce some of the reported data statistics.

In the paper, it is mentioned that after filtering SlimPajama for sequences longer than 64k, the Book subset contains approximately 33B tokens, and the CommonCrawl (CC) subset contains about 15B tokens.

In my own reproduction, however, I obtained noticeably smaller numbers:

Book: ~22B tokens

CC: ~10B tokens

This is roughly one-third fewer tokens than what is reported. Moreover, according to the SlimPajama-627B documentation, the Book portion constitutes only about 4.2% of the entire dataset, which would correspond to fewer than ~27B tokens in total. This makes the figure of 33B tokens for the >64k subset somewhat puzzling.

I would greatly appreciate it if you could clarify:

Were there additional filtering or preprocessing steps applied in ProLong that I may have overlooked?

Or are the reported numbers approximate, or perhaps derived from a different release/version of SlimPajama?

Thank you very much for your time and for any clarification you can provide.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Clarification on SlimPajama (>64k) token counts reported in ProLong #15

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Clarification on SlimPajama (>64k) token counts reported in ProLong #15

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions