Asm\x86\sort.asm - possible speedup

1. Hot loops (loop_down_1, opt_loop, main_loop_sort) execute millions of times; aligning each to a uop‑cache boundary ensures the front end always fetches them cleanly.

Z7_SORT_ASM_USE_SEGMENT equ 1
_TEXT$Z7_SORT SEGMENT ALIGN(64) 'CODE'

and at each inner loop label:

align 64

2. Activate the NUM_PREFETCH_LEVELS logic:

NUM_PREFETCH_LEVELS equ 4  ; prefetch two 64‑byte cache lines

Place the prefetch macro in the critical loops (MOVE_SMALLEST_UP, main_loop_sort), switching from default byte ptr [p+offs+cur_offset] to prefetcht0 unconditionally. It hides the L1 miss penalty for large datasets.

3. Some macros load into t0 only to store immediately back. Replace

LOAD    t0, s
STORE   t0, k

with a single mov [p+4*k], [p+4*s] when registers allow—cuts µops and memory ops in half.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Asm\x86\sort.asm - possible speedup #166

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Asm\x86\sort.asm - possible speedup #166

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions