Skip to content

Conversation

@amyxlu
Copy link
Collaborator

@amyxlu amyxlu commented Jul 9, 2025

Description

Scripts for token selection by assessing per-token loss under a pretrained autoregressive model (ProtGPT2):

  • Distributed dataloading by pre-specifying the NumPy offsets in the FASTA loader
  • Load ProtGPT2 model as option for all Lbster users
  • Launch SLURM jobs for large scale inference and saving per-token data to Parquet file

Type of Change

  • Bug fix
  • New feature
  • Documentation update
  • Performance improvement
  • Code refactoring

Testing

  • Tests pass locally
  • Added new tests for new functionality
  • Updated existing tests if needed

Checklist

  • Code follows style guidelines
  • Self-review completed
  • Documentation updated if needed
  • No breaking changes (or clearly documented)

@amyxlu amyxlu requested review from karinazad, kleinhenz and ncfrey July 9, 2025 22:24
@amyxlu amyxlu temporarily deployed to test.pypi.org July 20, 2025 15:04 — with GitHub Actions Inactive
@amyxlu amyxlu temporarily deployed to test.pypi.org July 20, 2025 15:08 — with GitHub Actions Inactive
@amyxlu amyxlu temporarily deployed to test.pypi.org July 28, 2025 17:02 — with GitHub Actions Inactive
@amyxlu amyxlu temporarily deployed to test.pypi.org July 28, 2025 17:02 — with GitHub Actions Inactive
torch.set_float32_matmul_precision("high")


def get_args():
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we use hydra for this?

@karinazad
Copy link
Collaborator

looks great! could you pls add a usage example to examples?

@karinazad
Copy link
Collaborator

token_selection should ideally be inside src/lobster or src/lobster/model/utils and the script in cmdline. Also, it would help if the script didn't implement any actual logic but just called functions defined somewhere in token_selection.py

cur_num_in_shard += len(outputs)

if cur_num_in_shard >= args.max_num_per_shard:
print(f"Saving shard {cur_shard_num} to {output_file}...")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

output_file seems to be defined only in the next line - does this work?


cur_shard_num += 1
cur_num_in_shard = 0
results_tmp_list = []
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think leftover results are not saved? if results_tmp_list has data when loop ends, it might be lost

"""Load sequence metadata from assigned shards."""
sequences = []

for shard_file in self.shard_files:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not just load all shards at once with pd.read_parquet?

@karinazad
Copy link
Collaborator

could you pls add tests?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants