Skip to content

Conversation

@pull
Copy link

@pull pull bot commented Dec 7, 2020

See Commits and Changes for more details.


Created by pull[bot]

Can you help keep this open source service alive? 💖 Please sponsor : )

@pull pull bot added ⤵️ pull merge-conflict Resolve conflicts manually labels Dec 7, 2020
bluss added 24 commits December 28, 2020 20:58
Add fast test switch so we can skip big matrices in cross tests
Add github actions to replace travis
This binary makes it easy to run custom benchmarks of bigger matrices
with custom size, layout and threading.
Add benchmark runner as an "example" binary
For std and threading we can use the thread_local!() macro, but for
no-std we'll need to use a stack array instead.
It depends a lot on hardware, when we should use threads, so even having
a heuristic is risky, but we'll add one and can improve it later.
On non-x86, this macro can be unused.
These are used all the time for profiling (and only affects development,
so they might as well be enabled.)
This is a performance fix, using one Lazy/OnceCell instead of two
separate ones saves a little time - it's just a few ns - which was
visible in the benchmark for (too) small matrices.
Use repr(align(x)) so we don't have to oversize and manually align the mask
buffer. Also use an UnsafeCell to remove (the very small, few ns)
overhead of borrowing the RefCell. (Its borrowing was pointless anyway,
since we held the raw pointer much longer than RefCell "borrow".)
bluss and others added 30 commits April 30, 2023 12:26
Use a different pack function for complex micorkernels which puts real and
imag parts in separate rows. This enables much better autovectorization for
the fallback kernels.
Custom sizes for Fma and Avx2 is a win for performance, and Avx2 does
better than fma here, so both can be worthwhile.
Because we detect target features to select kernel, and the kernel
can select its own packing functions, we can now specialize the packing
functions per target.

As matrices get larger, the packing performance matters much less, but
for small matrix products it contributes more to the runtime.

The default packing also already has a special case for contiguous
matrices, which happens when in C = A B, A is column major and B is row
major. The specialization in this commit helps the most outside this
special case.
avx2, fma and f32::mul_add is a success in autovectorization, while just
fma with f32::mul_add is not (!).

For this reason, only call f32::mul_add when we opt in to this.
Remove flags that are now used by default by miri.
cgemm was not tested as nostd in ci
For Bazel compatibility. Fixes #78
in layout benchmarks, which are used to check packing and kernel sensitivity
to memory layout, test some non-contiguous layouts.
A user showed that in certain configurations on macos, the TLS allocation can
even be 8-byte aligned.
Cargo cross does not support this old rust version anymore, increase
cross versions.
Completely distrust repr(align()) on macos and always manually ensure basic
alignment.
Requested 32-alignment for s390x but thread local storage does not
supply it. Lower requested align to 16 in general to avoid having this
problem pop up on other platforms too.
Windows 7 does not respect 16 byte alignment for thread locals on i686
builds. Rust 1.79 changed i686 windows builds to use native thread local
support. As a result, using `matrixmultiply` on i686 win7 builds leads
to a UB check panic nounwind when run on Windows 7. See
<rust-lang/rust#138903> for more info.

This change adds `i686-win7-windows-msvc` as an excluded target for the
alignment attribute on `MaskBuffer`.
- Removed redundant `_mm256_permute2f128_ps` instructions for lane swapping.
- Consolidated `bv_lh` usage for upper and lower halves, reducing the number of separate permutes.
- Reordered final output assignments to match the expected layout directly, simplifying downstream processing.
- This change reduces register pressure and improves instruction efficiency without altering the computation logic.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

⤵️ pull merge-conflict Resolve conflicts manually

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants