You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Revise 'apply' documentation for FlashInfer integration (#86)
Updated the documentation for the 'apply' feature in FlashInfer,
enhancing clarity and detail on usage, runtime substitution, and custom
integration patterns.
return F.linear(A, B) # fallback/reference or a simple baseline
210
-
```
211
-
212
-
**Turn on runtime substitution:**
201
+
With `apply`, we can dynamically replace the kernels in the FlashInfer API with the best-performing ones from our traces. With adapters already written for FlashInfer, you can enable integration with minimal code changes.
213
202
214
203
```bash
215
204
export FIB_ENABLE_APPLY=1
205
+
export FIB_DATASET_PATH=/path/to/flashinfer-trace
216
206
python serve_or_benchmark.py
217
207
```
218
208
219
-
At call time, `apply` looks up the **Definition** (by name or via the lambda), matches the current **workload** (axes +, when required, data properties), and dispatches to the **best**`Solution` according to your recorded **Traces** (with correctness constraints and numeric tolerances enforced).
209
+
At call time, `apply` looks up the Definition, matches the current workload (axes and input data properties), and dispatches to the bestSolution according to our Traces (with correctness constraints and numeric tolerances enforced).
220
210
221
-
### Advanced Usage: Supporting kernels that don’t align with the Definition
211
+
### Supporting kernels that don't align with the Definition with adapters
222
212
223
-
Sometimes your production call site can’t be decorated directly—e.g., wrappers that keep internal state across `plan()`/`run()` like `BatchPrefillWithPagedKVCacheWrapper`. In these cases the function you call at runtime doesn’t match the kernel definition’s flat signature, so the decorator can’t attach cleanly. Use the imperative form instead.
213
+
Sometimes your production call site can't be decorated directly—e.g., wrappers that keep internal state across `plan()`/`run()` like `BatchPrefillWithPagedKVCacheWrapper`. FlashInfer-Bench provides built-in adapters for common FlashInfer kernels, and you can also use the imperative `apply()` API for custom integration patterns.
Use the function form of `apply` anywhere you call the kernel. It will (1) in **apply** mode: look up the best Solution for the current workload and call it; (2) in **tracing** mode: record the workload, then run the fallback; (3) otherwise: just call the fallback.
217
+
FlashInfer-Bench automatically patches common FlashInfer kernels when you enable apply. No manual decoration needed:
228
218
229
-
```python
230
-
import flashinfer
219
+
**How it works:** When you call `enable_apply()`, FlashInfer-Bench automatically installs lightweight adapters that:
220
+
1. Intercept FlashInfer wrapper methods (`plan` and `run`)
221
+
2. Extract runtime parameters and match them to definitions
222
+
3. Dispatch to the best-performing solution from your traces
223
+
4. Fall back to the original FlashInfer implementation if no suitable solution exists
231
224
232
-
result = flashinfer.apply(
233
-
name: Union[str, Callable[..., str]],
234
-
fallback_function: Callable[..., Any],
235
-
*args, # All arguments must follow the **kernel definition’s interface
See `flashinfer_bench/integration/flashinfer/adapters/` for the complete list and implementation details.
241
232
242
-
In this example, the FlashInfer attention wrapper carries state from `plan()`into `run()`, while the FlashInfer-Bench definition exposes a single `attention(init_params, plan_params, run_params)` entry point. Bridge them with a small monkey-patch that reconstructs the original flow as the fallback:
233
+
#### Imperative `apply()`API (Custom Integration)
243
234
244
-
```python
245
-
# Original wrapper shape (state lives across plan/run)
runtime_kwargs: Dict[str, Any], # All arguments must follow the **kernel definition's interface
243
+
fallback: Optional[Callable[..., Any]] =None,
244
+
)
291
245
```
292
246
293
-
This preserves wrapper state while letting **apply** choose the best solution (and still trace workloads when enabled).
294
-
295
-
#### Alternative: avoid monkey-patching (shim inside the class)
296
-
297
-
If you can edit the wrapper, define a tiny adapter that flattens `(init, plan, run)` into the definition’s signature and call `flashinfer.apply(...)` directly inside `run()`. Same behavior, fewer moving parts.
298
-
299
-
#### Scope & tips
247
+
**Parameters:**
248
+
-`def_name_or_resolver`: The kernel definition name (e.g., `"gemm_bf16"`) or a resolver function that maps runtime arguments to a definition name.
249
+
-`runtime_kwargs`: Dictionary of keyword arguments to pass to the selected kernel. Must match the kernel definition's interface.
250
+
-`fallback`: Optional fallback function to invoke when no matching kernel is found in the Trace database.
300
251
301
-
* Make sure your adapter/fallback **matches the Definition I/O** exactly.
302
-
* Group `init_params`, `plan_params`, and `run_params` so they cover the definition’s required tensors (e.g., `Q, K, V, page_size, page_indptr`).
303
-
* When definitions vary by shape, pass a **`name` lambda** (e.g., derive hidden size from weights) to resolve the correct Definition at call time.
252
+
#### Example: Creating custom adapters (advanced)
304
253
305
-
## Related customization you can enable
254
+
If you want to create reusable adapters similar to the built-in FlashInfer integrations, study the real implementations:
0 commit comments