docs: Clarify and enhance BYO Kernel guide (#85)

YiyanZhai · web-flow · commit ea72b85cb966 · 2025-10-20T17:12:38.000-04:00
diff --git a/docs/tutorials/bring_your_own_kernel.mdx b/docs/tutorials/bring_your_own_kernel.mdx
@@ -1,8 +1,8 @@
 # Bring Your Own Kernel to FlashInfer-Bench
 
-This guide gives instructions on how to add Definitions, Solutions, capture Workloads, and record Evaluations by walking through each **component of the Trace**, with an end-to-end “apply at runtime” flow.
+This guide gives instructions on how to add Definitions, Solutions, capture Workloads, and record Evaluations by walking through each **component of the Trace**, with an end-to-end "apply at runtime" flow.
 
-A **Trace** is an atomic, immutable record of a single benchmark run. It links a specific `Solution` to a specific `Definition`, fixes the exact `workload` (shapes + data), and stores the complete `evaluation`. A folder of Definitions, Solutions, and Traces is your benchmark database.
+A **Trace** is an atomic, immutable record of a single benchmark run. It links a specific `Solution` to a specific `Definition`, fixes the exact `workload` (input shapes + input data), and stores the complete `evaluation`. A folder of Definitions, Solutions, and Traces is your benchmark database.
 
 ## Trace Schema (top level)
 
@@ -13,17 +13,19 @@ A **Trace** is an atomic, immutable record of a single benchmark run. It links a
 | `workload`   | object | Yes      | Concrete shapes and input data used for this run. |
 | `evaluation` | object | Yes      | Results, logs, and environment snapshot.          |
 
+More details about schema are in [FlashInfer Trace Schema](https://bench.flashinfer.ai/docs/flashinfer_trace/flashinfer_trace).
+
 ## Component 1: `definition`
 
-**What it is.** The operator’s contract: axes (const/var), inputs/outputs, constraints, and a correct (not necessarily fast) `reference`.
+**What it is:** The operator’s contract: axes (const/var), inputs/outputs, constraints, and a correct (not necessarily fast) `reference`.
 
-**Identity rule.** Two kernels are the **same Definition** iff:
+**Identity rule:** Two kernels are under the same Definition iff:
 
-* They have the **same axes**,
-* Each axis has the **same role** (`const` vs `var`),
-* All `const` axes have the **same values**.
+* They have the same axes,
+* Each axis has the same role (`const` vs `var`),
+* All `const` axes have the same values.
 
-**How to add a new kernel Definition.**
+**How to add a new kernel Definition:**
 
 1. Refer to schema, choose a `name` (`<type>_<stage>_<axis tokens>`) and `type`; write a clear `description` and helpful `tags`.
 2. Specify `axes` with `type: const|var` (+ `value` for const).
@@ -34,58 +36,58 @@ A **Trace** is an atomic, immutable record of a single benchmark run. It links a
 
 ## Component 2: `solution`
 
-**What it is.** A concrete implementation of a Definition’s interface (Triton/CUDA/CUTLASS/PyTorch, etc.) plus metadata including target archs, libraries, author (human or LLM).
+**What it is:** A concrete implementation of a Definition’s interface (Triton/CUDA/CUTLASS/PyTorch, etc.) plus metadata including target archs, libraries, author (human or LLM).
 
-**Interface.** Your function must take the Definition’s `inputs` and **return** the tuple of `outputs`.
+**Interface:** Your function must take the Definition’s `inputs` and return the tuple of `outputs`.
 
-**How to add a Solution.**
+**How to add a Solution:**
 
 1. Add the implementation of the kernel (matching signature).
 2. Provide metadata co-located with the code, according to schema.
-3. Add unit tests vs `reference` across a representative shapes.
-
-See agent.md (to be added) for our methods to generate Solutions with LLMs.
+3. Add unit tests vs `reference` across representative shapes.
 
 ## Component 3: `workload`
 
-**What it is.** The concrete axes + input data that instantiate a Definition for one run.
+**What it is:** The concrete axes + input data that instantiate a Definition for one run.
 
 | Field    | Description                                   |
 | -------- | --------------------------------------------- |
-| `axes`   | Map of **var** axis → concrete int value.     |
-| `inputs` | Map of **input name** → **actual input**.     |
+| `axes`   | Map of var axis → concrete int value.     |
+| `inputs` | Map of input name → actual input.     |
 
-**How to capture workloads.**
+**How to capture workloads:**
 
-### **Env-vars (zero-code):**
+### Env-vars (zero-code)
 
-1. **Choose an output dataset root** (optional):
+1. Choose an output dataset root (optional):
 
 ```bash
-export FLASHINFER_BENCH_DATASET_PATH=/root/flashinfer-trace
-# defaults to ./flashinfer-trace if unset
+export FIB_DATASET_PATH=/root/flashinfer-trace
+# defaults to `~/.cache/flashinfer_bench/dataset` if unset
 ```
 
-2. **Enable tracing and run your engine or script:**
+2. Enable tracing and run your engine or script:
 
 ```bash
-export FLASHINFER_BENCH_ENABLE_TRACING=1
+export FIB_ENABLE_TRACING=1
 python run_engine.py  # your serving or batch script
 ```
 
-By default, all kernels with a matching **Definition** are traced.
+By default, all kernels specified with its [tracing config](https://github.com/flashinfer-ai/flashinfer-bench/blob/main/flashinfer_bench/tracing/builtin/configs.py) with a matching Definition are traced.
 
-3. **What gets saved & where (default layout):**
+3. What gets saved & where (default layout):
 ```
-$FLASHINFER_BENCH_DATASET_PATH/
-└── workloads/
-    ├── *.jsonl               # workload records (FlashInfer Trace format)
-    └── safetensors/          # tensor payloads (when dumped)
+$FIB_DATASET_PATH/
+├── workloads/
+│   └── <op_type>/
+│       └── <definition_name>.jsonl   # workload records (FlashInfer Trace format)
+└── blob/
+    └── workloads/            # tensor payloads (safetensors, when dumped)
 ```
 
-Writing tensors to file is **async** (background thread) to reduce runtime overhead.
+Writing tensors to file is async (background thread) to reduce runtime overhead.
 
-### **Tracing in code (fine-grained control)**
+### Tracing in code (fine-grained control)
 
 If you want to target a subset of kernels / customize policies:
 
@@ -95,14 +97,14 @@ import flashinfer_bench as fib
 # 1) Pick which kernels to trace and how
 from flashinfer_bench import TracingConfig
 
-gqa_tracing = TracingConfig(
-    tensor_dump_policy="dump_non_float",   # keep scalar and int tensors; skip large float payloads
+gqa_paged_prefill_config = TracingConfig(
+    input_dump_policy="dump_non_float",   # keep scalar and int tensors; skip large float payloads
     filter_policy="shape_only",             # save first occurrence per input-shape signature
 )
 
 configs = {
-    "gqa_paged_decode_h32_kv4_d128_ps1": gqa_tracing,
-    # more kernel definitions...
+    "gqa_paged_prefill_causal_h32_kv4_d128_ps1": gqa_paged_prefill_config,
+    # more tracing config mappings...
 }
 
 # 2) Enable, run, then finalize
@@ -112,47 +114,44 @@ with fib.enable_tracing(dataset_path="/root/flashinfer-trace", tracing_configs=c
 
 **Policies you can use right away:**
 
-* `tensor_dump_policy`: `"dump_all"`, `"dump_none"`, `"dump_non_float"`, or a list of input names to dump.
-* `filter_policy`: `"keep_all"`, `"shape_only"`, `"keep_first_k"` (e.g., first k calls), or a custom callable `Workload -> key`.
-  These reduce disk/time while keeping representative samples.
+* `input_dump_policy`: `"dump_all"`, `"dump_none"`, `"dump_int32"`, or a list of input names to dump, like `input_dump_policy=["qo_indptr", "kv_indptr", "kv_indices", "sm_scale"]`.
+* `filter_policy`: `"keep_all"`, `"keep_first"` (e.g., first k calls), `"keep_first_by_axes"`, `"keep_none"`, or a custom callable `Workload -> key`. These reduce disk/time while keeping representative samples.
 
 ## Component 4: `evaluation`
 
-**What it is.** The result bundle for one `(definition, solution, workload)` run.
+**What it is:** The result bundle for one `(definition, solution, workload)` run.
 
-**How to benchmark to produce Evaluations.**
+**How to benchmark to produce Evaluations:**
 Run the benchmarker over your `(definition, solution, workload)` triples in the dataset:
 
-CLI:
+Using CLI:
   ```bash
-  flashinfer-bench run --local ./flashinfer-trace --warmup-runs 10 --iterations 50 --save-results
+  flashinfer-bench run --local /path/to/flashinfer-trace
   ```
 
-Use Python API:
-### Prepare a `TraceSet` and Run the benchmark
+Using Python API:
 
 ```python
 from flashinfer_bench.data import TraceSet
 from flashinfer_bench.bench import Benchmark
-from flashinfer_bench.bench import BenchmarkConfig
 
 # 1) Build TraceSet (definitions, solutions, workloads)
 trace_set = TraceSet(root="./flashinfer-trace")  # scans for definitions, solutions, workloads
 
 # 2) Run the benchmark
-bench = Benchmark(trace_set)
-bench.run_all(dump_traces=True)   # executes reference + solutions in parallel
+benchmark = Benchmark(trace_set, config)
+benchmark.run_all(save_results=True)
 ```
 
-* **Device pool.** One `MultiProcessRunner` is created per CUDA device.
-* **Concurrency.** For each definition and workload, the benchmark:
+* **Device pool:** One `MultiProcessRunner` is created per CUDA device.
+* **Concurrency:** For each definition and workload, the benchmark:
 
   * Picks up to `K = min(#devices, #solutions)` runners (round-robin).
   * **Reference phase:** in parallel, calls `runner.run_ref(defn, wl, config)` to build a baseline on each selected runner.
 
     * If a runner fails during reference, it is removed from the pool and the workload on that runner is skipped.
   * **Solutions phase:** distributes solutions round-robin across the runners that succeeded in the reference phase, calling `runner.run_solution(sol, baseline_handle, config)` in parallel.
-* **Status mapping.**
+* **Status mapping:**
 
   * Successful run with numerics in tolerance → `PASSED`.
   * Output shape/dtype mismatch → `INCORRECT_SHAPE` / `INCORRECT_DTYPE`.
@@ -213,7 +212,7 @@ def gemm(A, B):
 **Turn on runtime substitution:**
 
 ```bash
-export FLASHINFER_BENCH_ENABLE_APPLY=1
+export FIB_ENABLE_APPLY=1
 python serve_or_benchmark.py
 ```