|
| 1 | +# Microarchitecture Benchmark Suite |
| 2 | + |
| 3 | +This directory contains a suite of microbenchmarks designed to measure CPU microarchitectural properties, with a focus on instruction frontend (fetch/decode) behavior, cache hierarchies, and branch prediction characteristics. |
| 4 | + |
| 5 | +## Overview |
| 6 | + |
| 7 | +These benchmarks help characterize processor behavior by executing synthetic workloads and collecting performance counter data via Linux perf. The suite supports multiple architectures (x86-64, ARM64). One microbenchmarks is specific to NVIDIA GPUs |
| 8 | +and measures performance counters for kernel launches. |
| 9 | + |
| 10 | +## Microbenchmarks |
| 11 | + |
| 12 | +### 1. frontend_study |
| 13 | + |
| 14 | +**File:** `frontend_study.c` |
| 15 | + |
| 16 | +**Purpose:** Measure instruction frontend performance and cache behavior with configurable code layout. |
| 17 | + |
| 18 | +**What it measures:** |
| 19 | +- iTLB misses (Instruction Translation Lookaside Buffer) |
| 20 | +- L1I cache misses (Level 1 Instruction cache) |
| 21 | +- Branch prediction misses |
| 22 | +- L2 cache loads |
| 23 | +- Cycles and instructions executed |
| 24 | + |
| 25 | +**Key features:** |
| 26 | +- Creates multiple dynamically allocated memory regions containing function copies |
| 27 | +- Functions are filled with architecture-specific NOPs (4-byte instructions) |
| 28 | +- Adjustable function sizes: 16, 64, 256, 1024, 4096, 8192 bytes |
| 29 | +- Two access patterns: |
| 30 | + - **Sequential**: Calls functions in order modulo divisor |
| 31 | + - **Random**: Uses PRNG for random function selection |
| 32 | + |
| 33 | +**Command-line parameters:** |
| 34 | +``` |
| 35 | +-d <divisor> : Number of different functions to cycle through |
| 36 | +-i <iterations> : Total iterations of measurement loop |
| 37 | +-b <buffer_size_MB> : Size of allocated memory regions (MB) |
| 38 | +-n <num_buffers> : Number of separate memory regions |
| 39 | +-s <page_KB> : Page size for function alignment (KB) |
| 40 | +-f <func_nops> : Function size in NOPs (16/64/256/1024/4096/8192) |
| 41 | +-r <random_jumps> : 0=sequential, 1=random function selection |
| 42 | +``` |
| 43 | + |
| 44 | +**Example usage:** |
| 45 | +```bash |
| 46 | +./frontend_study -d 10 -i 1000000 -b 32 -n 256 -s 64 -f 1024 -r 0 |
| 47 | +``` |
| 48 | + |
| 49 | +### 2. instr_throughput |
| 50 | + |
| 51 | +**File:** `instr_throughput.c` |
| 52 | + |
| 53 | +**Purpose:** Measure instruction fetch and decode throughput across varying code size ranges. |
| 54 | + |
| 55 | +**What it measures:** |
| 56 | +- Cycles and instructions for code execution |
| 57 | +- L1I cache misses |
| 58 | +- iTLB misses |
| 59 | +- L2 cache loads |
| 60 | +- DRAM reads (LL cache misses) |
| 61 | +- Bytes per cycle (throughput metric) |
| 62 | + |
| 63 | +**Key features:** |
| 64 | +- Tests 19 different code sizes from 1KB to 1MB |
| 65 | +- Dynamically generates executable code buffers filled with NOP instructions |
| 66 | +- Per-iteration metric collection |
| 67 | +- Auto-scales iterations based on code size (larger code = fewer iterations) |
| 68 | + |
| 69 | +**Typical test sizes:** 1K, 4K, 8K, 16K, 32K, 64K, 128K, 192K, 256K, 512K, 1M |
| 70 | + |
| 71 | +**Output:** Throughput metrics per iteration for each code size |
| 72 | + |
| 73 | +### 3. btb_estimate |
| 74 | + |
| 75 | +**File:** `btb_estimate.c` |
| 76 | + |
| 77 | +**Purpose:** Estimate Branch Target Buffer (BTB) capacity and behavior by measuring branch prediction performance. |
| 78 | + |
| 79 | +**What it measures:** |
| 80 | +- Branch misses |
| 81 | +- Instructions executed |
| 82 | +- Misses per instruction, per iteration, per buffer entry |
| 83 | + |
| 84 | +**Key features:** |
| 85 | +- Tests with buffer sizes from 256 to 262,144 entries (13 sizes) |
| 86 | +- Buffer contains random 0s and 1s |
| 87 | +- Based on buffer value, executes 1023 or 1024 NOPs (single-bit difference) |
| 88 | +- Tracks when branch prediction fails |
| 89 | +- Helps determine BTB capacity on the processor |
| 90 | + |
| 91 | +**Logic:** Larger buffer sizes that cause more misses indicate exceeding BTB capacity. |
| 92 | + |
| 93 | +### 4. btb_estimate_calls |
| 94 | + |
| 95 | +**File:** `btb_estimate_calls.c` |
| 96 | + |
| 97 | +**Purpose:** Estimate BTB capacity using indirect function calls instead of conditional branches. |
| 98 | + |
| 99 | +**What it measures:** |
| 100 | +- Branch misses from call instructions |
| 101 | +- Instructions executed |
| 102 | +- Misses per iteration and per function pointer |
| 103 | + |
| 104 | +**Key features:** |
| 105 | +- Allocates 128 to 4096 function pointers |
| 106 | +- Each function pointer stored on a separate 64KB page |
| 107 | +- All functions are identical (1024 NOPs + return) |
| 108 | +- Measures branch prediction misses when calling through these pointers |
| 109 | +- Random offsets within pages prevent trivial prediction |
| 110 | + |
| 111 | +**Pages tested:** 128 to 4096 pages (21 sizes) |
| 112 | + |
| 113 | +**Output:** Branch miss metrics as function count increases (identifies BTB saturation point) |
| 114 | + |
| 115 | +### 5. fe_study_cuda |
| 116 | + |
| 117 | +**File:** `fe_study_cuda.cu` |
| 118 | + |
| 119 | +**Purpose:** Study instruction frontend behavior on NVIDIA GPUs. |
| 120 | + |
| 121 | +**What it measures:** |
| 122 | +- GPU cycles and instructions |
| 123 | +- L1I cache misses on GPU |
| 124 | +- Function-specific overhead measurement |
| 125 | +- GPU instruction issue patterns |
| 126 | + |
| 127 | +**Features:** |
| 128 | +- CUDA kernel compilation (sm_90 for x86-64, sm_100 for ARM) |
| 129 | +- NOP-based synthetic workloads |
| 130 | +- Profiles flush overhead for L1I cache |
| 131 | + |
| 132 | +**Target architecture:** NVIDIA GPUs |
| 133 | + |
| 134 | +## Supporting Files |
| 135 | + |
| 136 | +### utils.c / utils.h |
| 137 | + |
| 138 | +Provides common functionality for all benchmarks: |
| 139 | +- **Perf counter abstraction**: Wraps Linux `perf_event_open` syscall |
| 140 | +- **Counter sets**: iTLB, L1I, Branch, L2, DRAM reads |
| 141 | +- **CPU frequency detection**: Reads from `/sys/devices/system/cpu/` or `/proc/cpuinfo` |
| 142 | +- **Measurement results aggregation**: Structures for storing multi-counter data |
| 143 | +- **LCG random number generator**: Deterministic RNG (`my_rand()`) |
| 144 | + |
| 145 | +### run_benchmark.sh |
| 146 | + |
| 147 | +Test harness for batch execution: |
| 148 | +- Reads input configuration from a file |
| 149 | +- Executes benchmarks with multiple parameter combinations |
| 150 | +- Outputs results in CSV format |
| 151 | + |
| 152 | +### full_run_input.txt |
| 153 | + |
| 154 | +Sample configuration parameters for `frontend_study`: |
| 155 | +- Tests varying divisors (16-512), iterations (10M-100M) |
| 156 | +- Memory configurations: 1-512MB buffers with 64KB pages |
| 157 | +- Function sizes: 16-8192 NOPs |
| 158 | + |
| 159 | +**Example row:** `10000000,100000000,32,256,64,16,0` |
| 160 | +- Divisor=10M, Iterations=100M, Buffer=32MB, Buffers=256, Page=64KB, Function=16 NOPs, No random |
| 161 | + |
| 162 | +## Building |
| 163 | + |
| 164 | +```bash |
| 165 | +make |
| 166 | +``` |
| 167 | + |
| 168 | +The Makefile creates these executables: |
| 169 | +1. `frontend_study` - CPU frontend study (gcc) |
| 170 | +2. `fe_study_cuda` - CUDA version (nvcc) |
| 171 | +3. `instr_throughput` - Throughput benchmark |
| 172 | +4. `btb_estimate` - BTB sizing (branches) |
| 173 | +5. `btb_estimate_calls` - BTB sizing (calls) |
| 174 | + |
| 175 | +Build configuration: |
| 176 | +- Uses `gcc` for C benchmarks with `-O2` optimization |
| 177 | +- Uses `nvcc` for CUDA benchmarks with architecture-specific targets |
| 178 | +- Supports both x86-64 and ARM64 architectures |
| 179 | + |
| 180 | +## Usage Examples |
| 181 | + |
| 182 | +**Individual benchmark:** |
| 183 | +```bash |
| 184 | +./frontend_study -d 10 -i 1000000 -b 32 -n 256 -s 64 -f 1024 -r 0 |
| 185 | +``` |
| 186 | + |
| 187 | +**Batch execution with script:** |
| 188 | +```bash |
| 189 | +./run_benchmark.sh full_run_input.txt > results.csv |
| 190 | +``` |
| 191 | + |
| 192 | +**Run instruction throughput tests:** |
| 193 | +```bash |
| 194 | +./instr_throughput |
| 195 | +``` |
| 196 | + |
| 197 | +**Estimate BTB capacity:** |
| 198 | +```bash |
| 199 | +./btb_estimate |
| 200 | +./btb_estimate_calls |
| 201 | +``` |
| 202 | + |
| 203 | +## Measurement Focus Areas |
| 204 | + |
| 205 | +| Benchmark | Primary Focus | Architecture | Key Metrics | |
| 206 | +|-----------|--------------|--------------|-------------| |
| 207 | +| frontend_study | Frontend/cache behavior | x86-64, ARM64 | iTLB, L1I, L2, Branch misses | |
| 208 | +| instr_throughput | Code size scaling | x86-64, ARM64 | Throughput, cache misses | |
| 209 | +| btb_estimate | BTB capacity (branches) | x86-64, ARM64 | Branch misses vs buffer size | |
| 210 | +| btb_estimate_calls | BTB capacity (calls) | x86-64, ARM64 | Call misses vs function count | |
| 211 | +| ARM BTB events | ARM64 | fe_study_cuda | GPU frontend | CUDA | GPU-specific patterns | |
| 212 | + |
| 213 | +## Performance Counter Requirements |
| 214 | + |
| 215 | +These benchmarks require access to Linux perf events. You may need to adjust perf_event_paranoid settings: |
| 216 | + |
| 217 | +```bash |
| 218 | +# Check current setting |
| 219 | +cat /proc/sys/kernel/perf_event_paranoid |
| 220 | + |
| 221 | +# Allow user access to performance counters (may require sudo) |
| 222 | +echo 1 | sudo tee /proc/sys/kernel/perf_event_paranoid |
| 223 | +``` |
| 224 | + |
| 225 | +## Use Cases |
| 226 | + |
| 227 | +This benchmark suite is useful for: |
| 228 | +- Characterizing CPU microarchitecture behavior |
| 229 | +- Identifying performance bottlenecks in instruction fetch and decode |
| 230 | +- Understanding cache hierarchy characteristics |
| 231 | +- Measuring branch prediction capabilities and BTB capacity |
| 232 | +- Comparing performance across different processor generations |
| 233 | +- GPU instruction frontend analysis |
0 commit comments