-
Notifications
You must be signed in to change notification settings - Fork 312
aarch64: use read_unaligned for vld1_*
#2004
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
I think it would be nice to add a comment to the .yaml documenting the rationale and linking to this PR |
|
I'm happy just checking for LD, seems enough for the quick optimisation sanity check that the instruction assertion tests are there for. We have behavioural tests for these, and what variant of load ends up used in the test shims (that just have one intrinsic called, not really a realistic example) isn't too important As far as I know these intrinsics support unaligned reads, so read_unaligned is correct. The patch makes sense on its own regardless of the bug as part of #1659 I haven't dug much into the missed optimisation you've shown, but the ref intrinsic examples seems to optimise much better if you avoid dereferencing the reference and just |
fd709ac to
6365646
Compare
Sure, that would never create the intermediate
There would be generic optimizations for |
|
Gotcha, yeah that is weird. Here's a simpler example in C that does the exact same thing with the vector tuple, but not with the single vector result. To me this matches your suggestion that there's something special about this load intrinsic that the optimiser can't reason about. |
Correction, we don't have behavioural tests for many of these, I think for f16 and _xN variants. |
|
They aren't specifically in the Especially for aarch64_be having some tests is probably a good idea. |
|
The test tool can't handle impure functions so it filters out anything with a pointer in the signature. Looks like we didn't write manual tests for these when adding them I spoke to some Arm LLVM engineers. Sounds like they can probably eliminate the copy, is there an LLVM bug open for this? One opinion was that in C, if someone used an ld1x4 intrinsic, they would generally get the ld1x4 instruction variant coming out. The compiler probably doesn't generate them by default (probably favouring ldp) so they haven't really done much with them, assuming that the user probably knows something LLVM doesn't. With this patch there'd be no way to generate that instruction. This reasoning probably extends to other intrinsics too as they're designed to give a high level of control rather than optimise well, where these two concerns overlap each other |
|
I'm not aware of any related LLVM issue.
I'll add some tests then, especially for |
6365646 to
4ec72fb
Compare
|
I've rebased on the tests, and it all works (even on aarch64be), so this should be ready. |
The custom intrinsics for
vld1_*optimize less well than a standard unaligned read.https://rust.godbolt.org/z/8T6Kr63K4
That seems like something that should be fixed in LLVM, it should be able to eliminate this store to an alloca:
But for now we can fix it here. The only problem is how to test it. Maybe there is some clever way in the
ymlformat, but the issue is that some vector sizes useldp, others useldr. I don't currently see a way to encode that nicely. I have it on good authority (by an arm engineer), that there is no readon to preferld1over twoldps.This was found in
fearless_simd, @Shnatsel and I went on a bughunt, and finally found the cause for their weird codegen in a read that first dereferenced a reference to an array. There is some extra context here linebender/fearless_simd#185 (comment).cc @adamgemmell @CrooseGit if I'm missing anything, and if we can find a proper way to test this (or just accept testing for
ldand not specifying the exact instruction).edit: also I can't find anywhere whether the intrinsic assumes an aligned pointer or not, so maybe we should be using
core::ptr::readinstead?