Add token selection scripts #141

amyxlu · 2025-07-09T22:24:25Z

Description

Scripts for token selection by assessing per-token loss under a pretrained autoregressive model (ProtGPT2):

Distributed dataloading by pre-specifying the NumPy offsets in the FASTA loader
Load ProtGPT2 model as option for all Lbster users
Launch SLURM jobs for large scale inference and saving per-token data to Parquet file

Type of Change

Testing

Tests pass locally
Added new tests for new functionality
Updated existing tests if needed

Checklist

Code follows style guidelines
Self-review completed
Documentation updated if needed
No breaking changes (or clearly documented)

token_selection/scripts/inference.py

token_selection/scripts/save_token_losses.py

token_selection/scripts/save_token_losses.slrm

karinazad · 2025-09-03T20:33:57Z

token_selection/scripts/save_token_losses.py

+torch.set_float32_matmul_precision("high")
+
+
+def get_args():


can we use hydra for this?

karinazad · 2025-09-03T20:38:18Z

looks great! could you pls add a usage example to examples?

karinazad · 2025-09-03T20:40:50Z

token_selection should ideally be inside src/lobster or src/lobster/model/utils and the script in cmdline. Also, it would help if the script didn't implement any actual logic but just called functions defined somewhere in token_selection.py

karinazad · 2025-09-03T20:41:46Z

token_selection/scripts/save_token_losses.py

+            cur_num_in_shard += len(outputs)
+
+            if cur_num_in_shard >= args.max_num_per_shard:
+                print(f"Saving shard {cur_shard_num} to {output_file}...")


output_file seems to be defined only in the next line - does this work?

karinazad · 2025-09-03T20:42:48Z

token_selection/scripts/save_token_losses.py

+
+                cur_shard_num += 1
+                cur_num_in_shard = 0
+                results_tmp_list = []


I think leftover results are not saved? if results_tmp_list has data when loop ends, it might be lost

karinazad · 2025-09-03T20:44:05Z

src/lobster/datasets/_sharded_parquet_dataset.py

+        """Load sequence metadata from assigned shards."""
+        sequences = []
+
+        for shard_file in self.shard_files:


why not just load all shards at once with pd.read_parquet?

karinazad · 2025-09-03T20:45:22Z

could you pls add tests?

amyxlu added 4 commits July 8, 2025 11:44

Update .gitignore

96dfb68

Token selection scripts

c7cf131

Load ProtGPT2 as option for model analysis

abd7e6a

Add option for explicit numpy offsets loading

3304449

amyxlu requested review from karinazad, kleinhenz and ncfrey July 9, 2025 22:24

ncfrey reviewed Jul 9, 2025

View reviewed changes

token_selection/scripts/inference.py Show resolved Hide resolved

token_selection/scripts/save_token_losses.py Outdated Show resolved Hide resolved

token_selection/scripts/save_token_losses.slrm Show resolved Hide resolved

amyxlu added 4 commits July 20, 2025 15:48

Move save_tokens file

9014c61

Update parser defaults and help prompts

4094f9d

Refactor sharede parquet dataset

739b33f

Add README documentation

8c3ead7

amyxlu temporarily deployed to test.pypi.org July 20, 2025 15:04 — with GitHub Actions Inactive

Ruff formatting

3022c6c

amyxlu temporarily deployed to test.pypi.org July 20, 2025 15:08 — with GitHub Actions Inactive

Ruff

9c55725

amyxlu temporarily deployed to test.pypi.org July 28, 2025 17:02 — with GitHub Actions Inactive

Merge branch 'main' into amyxlu/token-selection

90a51a2

amyxlu deployed to test.pypi.org August 14, 2025 19:23 — with GitHub Actions View deployment

Merge branch 'main' into amyxlu/token-selection

e993106

karinazad reviewed Sep 3, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add token selection scripts #141

Add token selection scripts #141

Uh oh!

amyxlu commented Jul 9, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

karinazad Sep 3, 2025

Uh oh!

karinazad commented Sep 3, 2025

Uh oh!

karinazad commented Sep 3, 2025

Uh oh!

karinazad Sep 3, 2025

Uh oh!

karinazad Sep 3, 2025

Uh oh!

karinazad Sep 3, 2025

Uh oh!

karinazad commented Sep 3, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

		torch.set_float32_matmul_precision("high")


		def get_args():

Add token selection scripts #141

Are you sure you want to change the base?

Add token selection scripts #141

Uh oh!

Conversation

amyxlu commented Jul 9, 2025

Description

Type of Change

Testing

Checklist

Uh oh!

Uh oh!

Uh oh!

Uh oh!

karinazad Sep 3, 2025

Choose a reason for hiding this comment

Uh oh!

karinazad commented Sep 3, 2025

Uh oh!

karinazad commented Sep 3, 2025

Uh oh!

karinazad Sep 3, 2025

Choose a reason for hiding this comment

Uh oh!

karinazad Sep 3, 2025

Choose a reason for hiding this comment

Uh oh!

karinazad Sep 3, 2025

Choose a reason for hiding this comment

Uh oh!

karinazad commented Sep 3, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants