Summary
This issue tracks two related improvements in FlashInfer-Bench:
- Reduce Python-side
apply() overhead so it’s negligible compared to kernel runtime.
- Improve the Adapter API so it’s easier to use.
Motivation
1. apply() overhead
apply() overhead makes it harder to trust end-to-end latency numbers for very fast solutions.
- The Python orchestration cost around
apply() is currently ~2% on Llama 3.1 8B, and can be further reduced.
2. Adapter usability
- Writing a new Adapter currently requires understanding several internal concepts like dispatch workflow.
- We’d like a smoother path for:
- Adding a new adapter.
- Configuring existing adapters.