Adding post instruction tokens for steering #103

SevKod · 2025-11-06T15:23:53Z

Introducing the ability to get activations and steering on the post-instruction tokens (as described in the appendix of https://arxiv.org/pdf/2406.11717 )

codecov · 2025-11-06T17:34:09Z

Codecov Report

❌ Patch coverage is 7.96020% with 185 lines in your changes missing coverage. Please review.
✅ Project coverage is 53.83%. Comparing base (75a2743) to head (3865a98).
⚠️ Report is 181 commits behind head on main.

Files with missing lines	Patch %	Lines
src/sdialog/interpretability/__init__.py	7.18%	168 Missing ⚠️
src/sdialog/agents.py	15.00%	17 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #103      +/-   ##
==========================================
+ Coverage   46.82%   53.83%   +7.01%     
==========================================
  Files          20       34      +14     
  Lines        4171     6133    +1962     
==========================================
+ Hits         1953     3302    +1349     
- Misses       2218     2831     +613

Files with missing lines	Coverage Δ
src/sdialog/agents.py	`50.28% <15.00%> (-1.79%)`	⬇️
src/sdialog/interpretability/__init__.py	`15.57% <7.18%> (-5.91%)`	⬇️

... and 16 files with indirect coverage changes

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

…i-inspector capabilities

sergioburdisso · 2025-11-13T16:37:04Z

src/sdialog/interpretability/__init__.py

+            # Negative index means "from the end of generated tokens", not from the end of all tokens
+            input_response = self.agent._hooked_responses[self.response_index]['input'][0]
+            if self.token_index < 0:
+                # Convert negative index to positive relative to generated tokens


this block of 3 lines of code can be replaced simply by activation_index = self.token_index

add system prompt to responses

c90b0d2

SevKod force-pushed the main branch from 4380c29 to c90b0d2 Compare November 6, 2025 17:30

SevKod added 8 commits November 6, 2025 19:05

fix consistency w.r.t first token activations

1703d99

can now see activations for system prompt

cf9d5c5

fix flake8 typo

811dee6

fix recursion for negative indexes

6f06043

add steering to system prompt tokens

7d05d04

fix typos

1117d82

add steering for all system prompt tokens, using str conditions

12631fa

add top_k support, allow the agent to see the inspector, improve mult…

60c0862

…i-inspector capabilities

sergioburdisso reviewed Nov 13, 2025

View reviewed changes

simplify logic

3865a98

sergioburdisso merged commit 45a99d9 into idiap:main Nov 14, 2025
4 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Adding post instruction tokens for steering #103

Adding post instruction tokens for steering #103

Uh oh!

SevKod commented Nov 6, 2025 •

edited

Loading

Uh oh!

codecov bot commented Nov 6, 2025 •

edited

Loading

Uh oh!

sergioburdisso Nov 13, 2025 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Adding post instruction tokens for steering #103

Adding post instruction tokens for steering #103

Uh oh!

Conversation

SevKod commented Nov 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov bot commented Nov 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

sergioburdisso Nov 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

SevKod commented Nov 6, 2025 •

edited

Loading

codecov bot commented Nov 6, 2025 •

edited

Loading

sergioburdisso Nov 13, 2025 •

edited

Loading