Skip to content

Commit 31ae95c

Browse files
valentinandreimeta-codesync[bot]
authored andcommitted
Microbenchmark battery to analyze a CPUs front-end performance (#343)
Summary: Pull Request resolved: #343 This microbenchmark battery aims to study the CPUs front-end performance. A few notable features: - We can identify branch target buffer length - We can identify call prediction buffer length - We can measure instruction fetch throughput when code sits at different cache levels - We can measure the behavior of the CPU in the context of many iTLB misses Reviewed By: excelle08 Differential Revision: D89094585 fbshipit-source-id: 2baa94e3a9beff446f83a769a3898ba9bb2b2b23
1 parent e158ee1 commit 31ae95c

File tree

11 files changed

+2239
-0
lines changed

11 files changed

+2239
-0
lines changed

uarch_bench/Makefile

Lines changed: 42 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,42 @@
1+
CC = gcc
2+
CFLAGS = -O3 -Wall -Wextra
3+
4+
CUDA_ROOT_DIR=/usr/local/cuda-12.8
5+
CUDA_ARCH=sm_100
6+
7+
CUDA_LIB_DIR=-L$(CUDA_ROOT_DIR)/lib/cuda-12.8
8+
CUDA_INC_DIR=-I$(CUDA_ROOT_DIR)/include
9+
NVCC = $(CUDA_ROOT_DIR)/bin/nvcc
10+
NVCCFLAGS = -O3 -arch=$(CUDA_ARCH) $(CUDA_INC_DIR) $(CUDA_LIB_DIR) -Xcompiler "-Wall -Wextra"
11+
12+
TARGET = frontend_study
13+
CUDA_TARGET = fe_study_cuda
14+
THROUGHPUT_TARGET = instr_throughput
15+
BTB_TARGET = btb_estimate
16+
BTB_CALLS_TARGET = btb_estimate_calls
17+
OBJS = utils.o
18+
19+
all: $(TARGET) $(CUDA_TARGET) $(THROUGHPUT_TARGET) $(BTB_TARGET) $(BTB_CALLS_TARGET)
20+
21+
$(TARGET): frontend_study.c $(OBJS)
22+
$(CC) $(CFLAGS) -o $(TARGET) frontend_study.c $(OBJS)
23+
24+
$(CUDA_TARGET): fe_study_cuda.cu $(OBJS)
25+
$(NVCC) $(NVCCFLAGS) -o $(CUDA_TARGET) fe_study_cuda.cu $(OBJS)
26+
27+
$(THROUGHPUT_TARGET): instr_throughput.c $(OBJS)
28+
$(CC) $(CFLAGS) -o $(THROUGHPUT_TARGET) instr_throughput.c $(OBJS)
29+
30+
$(BTB_TARGET): btb_estimate.c $(OBJS)
31+
$(CC) $(CFLAGS) -o $(BTB_TARGET) btb_estimate.c $(OBJS)
32+
33+
$(BTB_CALLS_TARGET): btb_estimate_calls.c $(OBJS)
34+
$(CC) $(CFLAGS) -o $(BTB_CALLS_TARGET) btb_estimate_calls.c $(OBJS)
35+
36+
utils.o: utils.c utils.h
37+
$(CC) $(CFLAGS) -c utils.c
38+
39+
clean:
40+
rm -f $(TARGET) $(CUDA_TARGET) $(THROUGHPUT_TARGET) $(BTB_TARGET) $(BTB_CALLS_TARGET) $(THROUGHPUT_DYNAMIC_TARGET) $(OBJS)
41+
42+
.PHONY: all clean

uarch_bench/README.md

Lines changed: 233 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,233 @@
1+
# Microarchitecture Benchmark Suite
2+
3+
This directory contains a suite of microbenchmarks designed to measure CPU microarchitectural properties, with a focus on instruction frontend (fetch/decode) behavior, cache hierarchies, and branch prediction characteristics.
4+
5+
## Overview
6+
7+
These benchmarks help characterize processor behavior by executing synthetic workloads and collecting performance counter data via Linux perf. The suite supports multiple architectures (x86-64, ARM64). One microbenchmarks is specific to NVIDIA GPUs
8+
and measures performance counters for kernel launches.
9+
10+
## Microbenchmarks
11+
12+
### 1. frontend_study
13+
14+
**File:** `frontend_study.c`
15+
16+
**Purpose:** Measure instruction frontend performance and cache behavior with configurable code layout.
17+
18+
**What it measures:**
19+
- iTLB misses (Instruction Translation Lookaside Buffer)
20+
- L1I cache misses (Level 1 Instruction cache)
21+
- Branch prediction misses
22+
- L2 cache loads
23+
- Cycles and instructions executed
24+
25+
**Key features:**
26+
- Creates multiple dynamically allocated memory regions containing function copies
27+
- Functions are filled with architecture-specific NOPs (4-byte instructions)
28+
- Adjustable function sizes: 16, 64, 256, 1024, 4096, 8192 bytes
29+
- Two access patterns:
30+
- **Sequential**: Calls functions in order modulo divisor
31+
- **Random**: Uses PRNG for random function selection
32+
33+
**Command-line parameters:**
34+
```
35+
-d <divisor> : Number of different functions to cycle through
36+
-i <iterations> : Total iterations of measurement loop
37+
-b <buffer_size_MB> : Size of allocated memory regions (MB)
38+
-n <num_buffers> : Number of separate memory regions
39+
-s <page_KB> : Page size for function alignment (KB)
40+
-f <func_nops> : Function size in NOPs (16/64/256/1024/4096/8192)
41+
-r <random_jumps> : 0=sequential, 1=random function selection
42+
```
43+
44+
**Example usage:**
45+
```bash
46+
./frontend_study -d 10 -i 1000000 -b 32 -n 256 -s 64 -f 1024 -r 0
47+
```
48+
49+
### 2. instr_throughput
50+
51+
**File:** `instr_throughput.c`
52+
53+
**Purpose:** Measure instruction fetch and decode throughput across varying code size ranges.
54+
55+
**What it measures:**
56+
- Cycles and instructions for code execution
57+
- L1I cache misses
58+
- iTLB misses
59+
- L2 cache loads
60+
- DRAM reads (LL cache misses)
61+
- Bytes per cycle (throughput metric)
62+
63+
**Key features:**
64+
- Tests 19 different code sizes from 1KB to 1MB
65+
- Dynamically generates executable code buffers filled with NOP instructions
66+
- Per-iteration metric collection
67+
- Auto-scales iterations based on code size (larger code = fewer iterations)
68+
69+
**Typical test sizes:** 1K, 4K, 8K, 16K, 32K, 64K, 128K, 192K, 256K, 512K, 1M
70+
71+
**Output:** Throughput metrics per iteration for each code size
72+
73+
### 3. btb_estimate
74+
75+
**File:** `btb_estimate.c`
76+
77+
**Purpose:** Estimate Branch Target Buffer (BTB) capacity and behavior by measuring branch prediction performance.
78+
79+
**What it measures:**
80+
- Branch misses
81+
- Instructions executed
82+
- Misses per instruction, per iteration, per buffer entry
83+
84+
**Key features:**
85+
- Tests with buffer sizes from 256 to 262,144 entries (13 sizes)
86+
- Buffer contains random 0s and 1s
87+
- Based on buffer value, executes 1023 or 1024 NOPs (single-bit difference)
88+
- Tracks when branch prediction fails
89+
- Helps determine BTB capacity on the processor
90+
91+
**Logic:** Larger buffer sizes that cause more misses indicate exceeding BTB capacity.
92+
93+
### 4. btb_estimate_calls
94+
95+
**File:** `btb_estimate_calls.c`
96+
97+
**Purpose:** Estimate BTB capacity using indirect function calls instead of conditional branches.
98+
99+
**What it measures:**
100+
- Branch misses from call instructions
101+
- Instructions executed
102+
- Misses per iteration and per function pointer
103+
104+
**Key features:**
105+
- Allocates 128 to 4096 function pointers
106+
- Each function pointer stored on a separate 64KB page
107+
- All functions are identical (1024 NOPs + return)
108+
- Measures branch prediction misses when calling through these pointers
109+
- Random offsets within pages prevent trivial prediction
110+
111+
**Pages tested:** 128 to 4096 pages (21 sizes)
112+
113+
**Output:** Branch miss metrics as function count increases (identifies BTB saturation point)
114+
115+
### 5. fe_study_cuda
116+
117+
**File:** `fe_study_cuda.cu`
118+
119+
**Purpose:** Study instruction frontend behavior on NVIDIA GPUs.
120+
121+
**What it measures:**
122+
- GPU cycles and instructions
123+
- L1I cache misses on GPU
124+
- Function-specific overhead measurement
125+
- GPU instruction issue patterns
126+
127+
**Features:**
128+
- CUDA kernel compilation (sm_90 for x86-64, sm_100 for ARM)
129+
- NOP-based synthetic workloads
130+
- Profiles flush overhead for L1I cache
131+
132+
**Target architecture:** NVIDIA GPUs
133+
134+
## Supporting Files
135+
136+
### utils.c / utils.h
137+
138+
Provides common functionality for all benchmarks:
139+
- **Perf counter abstraction**: Wraps Linux `perf_event_open` syscall
140+
- **Counter sets**: iTLB, L1I, Branch, L2, DRAM reads
141+
- **CPU frequency detection**: Reads from `/sys/devices/system/cpu/` or `/proc/cpuinfo`
142+
- **Measurement results aggregation**: Structures for storing multi-counter data
143+
- **LCG random number generator**: Deterministic RNG (`my_rand()`)
144+
145+
### run_benchmark.sh
146+
147+
Test harness for batch execution:
148+
- Reads input configuration from a file
149+
- Executes benchmarks with multiple parameter combinations
150+
- Outputs results in CSV format
151+
152+
### full_run_input.txt
153+
154+
Sample configuration parameters for `frontend_study`:
155+
- Tests varying divisors (16-512), iterations (10M-100M)
156+
- Memory configurations: 1-512MB buffers with 64KB pages
157+
- Function sizes: 16-8192 NOPs
158+
159+
**Example row:** `10000000,100000000,32,256,64,16,0`
160+
- Divisor=10M, Iterations=100M, Buffer=32MB, Buffers=256, Page=64KB, Function=16 NOPs, No random
161+
162+
## Building
163+
164+
```bash
165+
make
166+
```
167+
168+
The Makefile creates these executables:
169+
1. `frontend_study` - CPU frontend study (gcc)
170+
2. `fe_study_cuda` - CUDA version (nvcc)
171+
3. `instr_throughput` - Throughput benchmark
172+
4. `btb_estimate` - BTB sizing (branches)
173+
5. `btb_estimate_calls` - BTB sizing (calls)
174+
175+
Build configuration:
176+
- Uses `gcc` for C benchmarks with `-O2` optimization
177+
- Uses `nvcc` for CUDA benchmarks with architecture-specific targets
178+
- Supports both x86-64 and ARM64 architectures
179+
180+
## Usage Examples
181+
182+
**Individual benchmark:**
183+
```bash
184+
./frontend_study -d 10 -i 1000000 -b 32 -n 256 -s 64 -f 1024 -r 0
185+
```
186+
187+
**Batch execution with script:**
188+
```bash
189+
./run_benchmark.sh full_run_input.txt > results.csv
190+
```
191+
192+
**Run instruction throughput tests:**
193+
```bash
194+
./instr_throughput
195+
```
196+
197+
**Estimate BTB capacity:**
198+
```bash
199+
./btb_estimate
200+
./btb_estimate_calls
201+
```
202+
203+
## Measurement Focus Areas
204+
205+
| Benchmark | Primary Focus | Architecture | Key Metrics |
206+
|-----------|--------------|--------------|-------------|
207+
| frontend_study | Frontend/cache behavior | x86-64, ARM64 | iTLB, L1I, L2, Branch misses |
208+
| instr_throughput | Code size scaling | x86-64, ARM64 | Throughput, cache misses |
209+
| btb_estimate | BTB capacity (branches) | x86-64, ARM64 | Branch misses vs buffer size |
210+
| btb_estimate_calls | BTB capacity (calls) | x86-64, ARM64 | Call misses vs function count |
211+
| ARM BTB events | ARM64 | fe_study_cuda | GPU frontend | CUDA | GPU-specific patterns |
212+
213+
## Performance Counter Requirements
214+
215+
These benchmarks require access to Linux perf events. You may need to adjust perf_event_paranoid settings:
216+
217+
```bash
218+
# Check current setting
219+
cat /proc/sys/kernel/perf_event_paranoid
220+
221+
# Allow user access to performance counters (may require sudo)
222+
echo 1 | sudo tee /proc/sys/kernel/perf_event_paranoid
223+
```
224+
225+
## Use Cases
226+
227+
This benchmark suite is useful for:
228+
- Characterizing CPU microarchitecture behavior
229+
- Identifying performance bottlenecks in instruction fetch and decode
230+
- Understanding cache hierarchy characteristics
231+
- Measuring branch prediction capabilities and BTB capacity
232+
- Comparing performance across different processor generations
233+
- GPU instruction frontend analysis

0 commit comments

Comments
 (0)