-
Notifications
You must be signed in to change notification settings - Fork 31
Add token selection scripts #141
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
| torch.set_float32_matmul_precision("high") | ||
|
|
||
|
|
||
| def get_args(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we use hydra for this?
|
looks great! could you pls add a usage example to |
|
|
| cur_num_in_shard += len(outputs) | ||
|
|
||
| if cur_num_in_shard >= args.max_num_per_shard: | ||
| print(f"Saving shard {cur_shard_num} to {output_file}...") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
output_file seems to be defined only in the next line - does this work?
|
|
||
| cur_shard_num += 1 | ||
| cur_num_in_shard = 0 | ||
| results_tmp_list = [] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think leftover results are not saved? if results_tmp_list has data when loop ends, it might be lost
| """Load sequence metadata from assigned shards.""" | ||
| sequences = [] | ||
|
|
||
| for shard_file in self.shard_files: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why not just load all shards at once with pd.read_parquet?
|
could you pls add tests? |
Description
Scripts for token selection by assessing per-token loss under a pretrained autoregressive model (ProtGPT2):
Type of Change
Testing
Checklist