-
Notifications
You must be signed in to change notification settings - Fork 13
Description
First of all, thank you for your excellent work and for making your resources publicly available. I have been carefully going through the ProLong paper and attempting to reproduce some of the reported data statistics.
In the paper, it is mentioned that after filtering SlimPajama for sequences longer than 64k, the Book subset contains approximately 33B tokens, and the CommonCrawl (CC) subset contains about 15B tokens.
In my own reproduction, however, I obtained noticeably smaller numbers:
Book: ~22B tokens
CC: ~10B tokens
This is roughly one-third fewer tokens than what is reported. Moreover, according to the SlimPajama-627B documentation, the Book portion constitutes only about 4.2% of the entire dataset, which would correspond to fewer than ~27B tokens in total. This makes the figure of 33B tokens for the >64k subset somewhat puzzling.
I would greatly appreciate it if you could clarify:
Were there additional filtering or preprocessing steps applied in ProLong that I may have overlooked?
Or are the reported numbers approximate, or perhaps derived from a different release/version of SlimPajama?
Thank you very much for your time and for any clarification you can provide.