aarch64: use `read_unaligned` for `vld1_*` #2004

folkertdev · 2026-01-27T18:14:59Z

The custom intrinsics for vld1_* optimize less well than a standard unaligned read.

https://rust.godbolt.org/z/8T6Kr63K4

That seems like something that should be fixed in LLVM, it should be able to eliminate this store to an alloca:

  %val1 = alloca [64 x i8], align 4
  call void @llvm.lifetime.start.p0(i64 64, ptr nonnull %val1)
  call void @llvm.memcpy.p0.p0.i64(ptr noundef nonnull align 4 dereferenceable(64) %val1, ptr noundef nonnull align 4 dereferenceable(64) %val, i64 64, i1 false)
  %0 = call { <4 x float>, <4 x float>, <4 x float>, <4 x float> } @llvm.aarch64.neon.ld1x4.v4f32.p0(ptr noundef nonnull %val1) #4

But for now we can fix it here. The only problem is how to test it. Maybe there is some clever way in the yml format, but the issue is that some vector sizes use ldp, others use ldr. I don't currently see a way to encode that nicely. I have it on good authority (by an arm engineer), that there is no readon to prefer ld1 over two ldps.

This was found in fearless_simd, @Shnatsel and I went on a bughunt, and finally found the cause for their weird codegen in a read that first dereferenced a reference to an array. There is some extra context here linebender/fearless_simd#185 (comment).

cc @adamgemmell @CrooseGit if I'm missing anything, and if we can find a proper way to test this (or just accept testing for ld and not specifying the exact instruction).

edit: also I can't find anywhere whether the intrinsic assumes an aligned pointer or not, so maybe we should be using core::ptr::read instead?

rustbot · 2026-01-27T18:15:06Z

r? @sayantn

rustbot has assigned @sayantn.
They will have a look at your PR within the next two weeks and either review your PR or reassign to another reviewer.

Use r? to explicitly pick a reviewer

Shnatsel · 2026-01-27T18:24:40Z

I think it would be nice to add a comment to the .yaml documenting the rationale and linking to this PR

adamgemmell · 2026-01-28T18:16:01Z

I'm happy just checking for LD, seems enough for the quick optimisation sanity check that the instruction assertion tests are there for. We have behavioural tests for these, and what variant of load ends up used in the test shims (that just have one intrinsic called, not really a realistic example) isn't too important

As far as I know these intrinsics support unaligned reads, so read_unaligned is correct.

The patch makes sense on its own regardless of the bug as part of #1659

I haven't dug much into the missed optimisation you've shown, but the ref intrinsic examples seems to optimise much better if you avoid dereferencing the reference and just as cast it straight to a pointer before passing it to the intrinsic. When implementing these intrinsics in Rust we've basically aimed to generate similar LLVMIR to clang to ensure that we make use of the beaten path when it comes to optimising these intrinsics, which is why we preferred using LLVM builtins. Clearly read_unaligned seems to optimise well in real code though!

folkertdev · 2026-01-28T18:30:22Z

the ref intrinsic examples seems to optimise much better if you avoid dereferencing the reference and just as cast it straight to a pointer before passing it to the intrinsic.

Sure, that would never create the intermediate alloca. But because of how the fearless_simd signatures worked together, they did dereference the array reference.

Clearly read_unaligned seems to optimise well in real code though!

There would be generic optimizations for memcpy I imagine. But yeah we might miss some combines that are specific to the arm/aarch64 backend now. At least for some earlier cases, the LLVM folks believed that the arm backend should desugar the target-specific intrinsic to the general one, so maybe that is the case here too.

adamgemmell · 2026-01-29T15:24:12Z

Gotcha, yeah that is weird. Here's a simpler example in C that does the exact same thing with the vector tuple, but not with the single vector result. To me this matches your suggestion that there's something special about this load intrinsic that the optimiser can't reason about.

adamgemmell · 2026-01-29T16:15:55Z

We have behavioural tests for these,

Correction, we don't have behavioural tests for many of these, I think for f16 and _xN variants.

folkertdev · 2026-01-29T22:11:48Z

They aren't specifically in the crates/intrinsic-test/missing_aarch64.txt file. Do they get filtered out because of the type?

Especially for aarch64_be having some tests is probably a good idea.

adamgemmell · 2026-01-30T17:21:43Z

The test tool can't handle impure functions so it filters out anything with a pointer in the signature. Looks like we didn't write manual tests for these when adding them

I spoke to some Arm LLVM engineers. Sounds like they can probably eliminate the copy, is there an LLVM bug open for this?

One opinion was that in C, if someone used an ld1x4 intrinsic, they would generally get the ld1x4 instruction variant coming out. The compiler probably doesn't generate them by default (probably favouring ldp) so they haven't really done much with them, assuming that the user probably knows something LLVM doesn't. With this patch there'd be no way to generate that instruction.

This reasoning probably extends to other intrinsics too as they're designed to give a high level of control rather than optimise well, where these two concerns overlap each other

folkertdev · 2026-01-30T17:59:06Z

I'm not aware of any related LLVM issue.

stdarch does not promise that any particular instruction is emitted (though in practice, in most cases, it is kind of implied). Using an implementation that optimizes better and is supported by Miri/the other backends outweighs the control argument for us I think. Inline assembly is always an option if you really know better than the compiler.

I'll add some tests then, especially for aarch64be I'd like to make sure that this works.

folkertdev · 2026-02-02T15:56:29Z

I've rebased on the tests, and it all works (even on aarch64be), so this should be ready.

rustbot assigned sayantn Jan 27, 2026

Shnatsel mentioned this pull request Jan 27, 2026

Don't use load/store intrinsics linebender/fearless_simd#185

Merged

folkertdev force-pushed the arm-ld1-read branch from fd709ac to 6365646 Compare January 28, 2026 18:21

aarch64: use read_unaligned for vld1_*

4ec72fb

folkertdev force-pushed the arm-ld1-read branch from 6365646 to 4ec72fb Compare February 2, 2026 14:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

aarch64: use `read_unaligned` for `vld1_*` #2004

aarch64: use `read_unaligned` for `vld1_*` #2004

folkertdev commented Jan 27, 2026 •

edited

Loading

Uh oh!

rustbot commented Jan 27, 2026

Uh oh!

Shnatsel commented Jan 27, 2026

Uh oh!

adamgemmell commented Jan 28, 2026

Uh oh!

folkertdev commented Jan 28, 2026

Uh oh!

adamgemmell commented Jan 29, 2026

Uh oh!

adamgemmell commented Jan 29, 2026

Uh oh!

folkertdev commented Jan 29, 2026

Uh oh!

adamgemmell commented Jan 30, 2026

Uh oh!

folkertdev commented Jan 30, 2026

Uh oh!

folkertdev commented Feb 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

aarch64: use read_unaligned for vld1_* #2004

Are you sure you want to change the base?

aarch64: use read_unaligned for vld1_* #2004

Conversation

folkertdev commented Jan 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rustbot commented Jan 27, 2026

Uh oh!

Shnatsel commented Jan 27, 2026

Uh oh!

adamgemmell commented Jan 28, 2026

Uh oh!

folkertdev commented Jan 28, 2026

Uh oh!

adamgemmell commented Jan 29, 2026

Uh oh!

adamgemmell commented Jan 29, 2026

Uh oh!

folkertdev commented Jan 29, 2026

Uh oh!

adamgemmell commented Jan 30, 2026

Uh oh!

folkertdev commented Jan 30, 2026

Uh oh!

folkertdev commented Feb 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

aarch64: use `read_unaligned` for `vld1_*` #2004

aarch64: use `read_unaligned` for `vld1_*` #2004

folkertdev commented Jan 27, 2026 •

edited

Loading