Prevented fully cached sequences from being processed in prefill phase #117

codingliuyg · 2025-10-16T09:50:28Z

Thank you for this amazing open-source project!

Problem

Fixes the crash reported in #114 where the system crashes when prompt length exactly equals kvcache_block_size (256 tokens). Additionally, optimizes scheduler performance by preventing fully cached sequences from entering prefill computation.

Root Cause Analysis

Crash Issue: When seq.num_cached_tokens == len(seq), all tokens are cached, resulting in tokens_to_compute = 0
Empty Tensor Problem: Zero-length tensors are created and passed to CUDA kernels, causing "invalid configuration argument" error
Resource Waste: Fully cached sequences were still being processed in prefill phase, wasting computational resources

Solution

1. Scheduler Optimization (`scheduler.py`)

Before: All sequences were added to scheduled_seqs regardless of cache status

num_seqs += 1
scheduled_seqs.append(seq)
num_batched_tokens += len(seq) - seq.num_cached_tokens

After: Only sequences with tokens_to_compute > 0 are scheduled for computation

tokens_to_compute = len(seq) - seq.num_cached_tokens
if tokens_to_compute > 0:
    num_seqs += 1
    scheduled_seqs.append(seq)
    num_batched_tokens += tokens_to_compute

Benefits:

✅ Prevents empty tensor creation
✅ Eliminates unnecessary CUDA kernel launches
✅ Improves batch efficiency in high cache-hit scenarios

2. Block Manager (`block_manager.py`)

Before: Assertion failure when last_block.hash == -1 in certain edge cases

assert last_block.hash == -1
# ... hash computation code

After: Conditional check to handle edge cases gracefully

if last_block.hash == -1:
    # ... hash computation code

Purpose: This change ensures compatibility with fully cached sequences where the previous hash value is already correct and doesn't need recalculation.

Closes #114

I look forward to your review and would appreciate any feedback.

GentleCold · 2025-10-28T13:16:23Z

I think even if the entire seq is cached, it should still be scheduled, and the input at this time only needs to be the last token.

codingliuyg · 2025-11-03T03:19:10Z

I think even if the entire seq is cached, it should still be scheduled, and the input at this time only needs to be the last token.

@GentleCold You're absolutely right, and this doesn't conflict with my code. You're referring to the decode phase, and that's exactly what my implementation does: fully cached sequences are put directly into the running queue to be processed in the decode phase.

The key point is that fully cached sequences skip the prefill computation(where tokens_to_compute == 0), but they still get scheduled for decode operations.

GentleCold · 2025-11-04T12:59:39Z

I found, thank you for your explanation :)

Prevented fully cached sequences from being processed in prefill phase

6c3093b

codingliuyg force-pushed the fix/issue-114 branch from c29a23b to 6c3093b Compare November 3, 2025 06:18

GeeeekExplorer force-pushed the main branch from 6c000c0 to db1b49d Compare November 3, 2025 16:45

Merge branch 'GeeeekExplorer:main' into fix/issue-114

60bdb4e

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Prevented fully cached sequences from being processed in prefill phase #117

Prevented fully cached sequences from being processed in prefill phase #117

Uh oh!

codingliuyg commented Oct 16, 2025

Uh oh!

GentleCold commented Oct 28, 2025

Uh oh!

codingliuyg commented Nov 3, 2025 •

edited

Loading

Uh oh!

GentleCold commented Nov 4, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Prevented fully cached sequences from being processed in prefill phase #117

Are you sure you want to change the base?

Prevented fully cached sequences from being processed in prefill phase #117

Uh oh!

Conversation

codingliuyg commented Oct 16, 2025

Problem

Root Cause Analysis

Solution

1. Scheduler Optimization (scheduler.py)

2. Block Manager (block_manager.py)

Uh oh!

GentleCold commented Oct 28, 2025

Uh oh!

codingliuyg commented Nov 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

GentleCold commented Nov 4, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

1. Scheduler Optimization (`scheduler.py`)

2. Block Manager (`block_manager.py`)

codingliuyg commented Nov 3, 2025 •

edited

Loading