Skip to content

Conversation

@codingliuyg
Copy link

Thank you for this amazing open-source project!

Problem

Fixes the crash reported in #114 where the system crashes when prompt length exactly equals kvcache_block_size (256 tokens). Additionally, optimizes scheduler performance by preventing fully cached sequences from entering prefill computation.

Root Cause Analysis

  1. Crash Issue: When seq.num_cached_tokens == len(seq), all tokens are cached, resulting in tokens_to_compute = 0
  2. Empty Tensor Problem: Zero-length tensors are created and passed to CUDA kernels, causing "invalid configuration argument" error
  3. Resource Waste: Fully cached sequences were still being processed in prefill phase, wasting computational resources

Solution

1. Scheduler Optimization (scheduler.py)

Before: All sequences were added to scheduled_seqs regardless of cache status

num_seqs += 1
scheduled_seqs.append(seq)
num_batched_tokens += len(seq) - seq.num_cached_tokens

After: Only sequences with tokens_to_compute > 0 are scheduled for computation

tokens_to_compute = len(seq) - seq.num_cached_tokens
if tokens_to_compute > 0:
    num_seqs += 1
    scheduled_seqs.append(seq)
    num_batched_tokens += tokens_to_compute

Benefits:

  • ✅ Prevents empty tensor creation
  • ✅ Eliminates unnecessary CUDA kernel launches
  • ✅ Improves batch efficiency in high cache-hit scenarios

2. Block Manager (block_manager.py)

Before: Assertion failure when last_block.hash == -1 in certain edge cases

assert last_block.hash == -1
# ... hash computation code

After: Conditional check to handle edge cases gracefully

if last_block.hash == -1:
    # ... hash computation code

Purpose: This change ensures compatibility with fully cached sequences where the previous hash value is already correct and doesn't need recalculation.

Closes #114

I look forward to your review and would appreciate any feedback.

@GentleCold
Copy link

I think even if the entire seq is cached, it should still be scheduled, and the input at this time only needs to be the last token.

@codingliuyg
Copy link
Author

codingliuyg commented Nov 3, 2025

I think even if the entire seq is cached, it should still be scheduled, and the input at this time only needs to be the last token.

@GentleCold You're absolutely right, and this doesn't conflict with my code. You're referring to the decode phase, and that's exactly what my implementation does: fully cached sequences are put directly into the running queue to be processed in the decode phase.

The key point is that fully cached sequences skip the prefill computation(where tokens_to_compute == 0), but they still get scheduled for decode operations.

@GentleCold
Copy link

I found, thank you for your explanation :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BUG] Crashes when the prompt length exactly equals kvcache_block_size

2 participants