[pull] master from bluss:master #3

pull · 2020-12-07T20:50:03Z

See Commits and Changes for more details.

Created by pull[bot]

Can you help keep this open source service alive? 💖 Please sponsor : )

Co-authored-by: Geordon Worley <[email protected]>

no_std support

Add fast test switch so we can skip big matrices in cross tests

Add github actions to replace travis

This binary makes it easy to run custom benchmarks of bigger matrices with custom size, layout and threading.

…ling

Add benchmark runner as an "example" binary

…ads)

…bled

For std and threading we can use the thread_local!() macro, but for no-std we'll need to use a stack array instead.

It depends a lot on hardware, when we should use threads, so even having a heuristic is risky, but we'll add one and can improve it later.

On non-x86, this macro can be unused.

These are used all the time for profiling (and only affects development, so they might as well be enabled.)

This is a performance fix, using one Lazy/OnceCell instead of two separate ones saves a little time - it's just a few ns - which was visible in the benchmark for (too) small matrices.

Use repr(align(x)) so we don't have to oversize and manually align the mask buffer. Also use an UnsafeCell to remove (the very small, few ns) overhead of borrowing the RefCell. (Its borrowing was pointless anyway, since we held the raw pointer much longer than RefCell "borrow".)

Use a different pack function for complex micorkernels which puts real and imag parts in separate rows. This enables much better autovectorization for the fallback kernels.

Custom sizes for Fma and Avx2 is a win for performance, and Avx2 does better than fma here, so both can be worthwhile.

Because we detect target features to select kernel, and the kernel can select its own packing functions, we can now specialize the packing functions per target. As matrices get larger, the packing performance matters much less, but for small matrix products it contributes more to the runtime. The default packing also already has a special case for contiguous matrices, which happens when in C = A B, A is column major and B is row major. The specialization in this commit helps the most outside this special case.

avx2, fma and f32::mul_add is a success in autovectorization, while just fma with f32::mul_add is not (!). For this reason, only call f32::mul_add when we opt in to this.

Remove flags that are now used by default by miri.

cgemm was not tested as nostd in ci

For Bazel compatibility. Fixes #78

in layout benchmarks, which are used to check packing and kernel sensitivity to memory layout, test some non-contiguous layouts.

A user showed that in certain configurations on macos, the TLS allocation can even be 8-byte aligned.

Cargo cross does not support this old rust version anymore, increase cross versions.

Completely distrust repr(align()) on macos and always manually ensure basic alignment.

Requested 32-alignment for s390x but thread local storage does not supply it. Lower requested align to 16 in general to avoid having this problem pop up on other platforms too.

Windows 7 does not respect 16 byte alignment for thread locals on i686 builds. Rust 1.79 changed i686 windows builds to use native thread local support. As a result, using `matrixmultiply` on i686 win7 builds leads to a UB check panic nounwind when run on Windows 7. See <rust-lang/rust#138903> for more info. This change adds `i686-win7-windows-msvc` as an excluded target for the alignment attribute on `MaskBuffer`.

- Removed redundant `_mm256_permute2f128_ps` instructions for lane swapping. - Consolidated `bv_lh` usage for upper and lower halves, reducing the number of separate permutes. - Reordered final output assignments to match the expected layout directly, simplifying downstream processing. - This change reduces register pressure and improves instruction efficiency without altering the computation logic.

jturner314 and others added 4 commits December 6, 2020 16:14

add no_std support

5d7ae23

Co-authored-by: Geordon Worley <[email protected]>

Merge pull request #51 from vadixidav/no_std

97921e9

no_std support

MAINT: Silence unused items warnings (these fire on non-x86)

6243d28

0.2.4

79b57a3

pull bot added ⤵️ pull merge-conflict Resolve conflicts manually labels Dec 7, 2020

bluss added 24 commits December 28, 2020 20:58

TEST: Add github actions to replace travis

48fdf21

Add fast test switch so we can skip big matrices in cross tests

Merge pull request #53 from bluss/gh-actions

11ec355

Add github actions to replace travis

DOC: Remove travis badge in readme

a713a6b

TEST: Add benchmark runner as example binary

b81d267

This binary makes it easy to run custom benchmarks of bigger matrices with custom size, layout and threading.

TEST: Add csv output format to benchmark program and fixup error hand…

fc30d9d

…ling

TEST: Skip testing examples on MSRV

e926f0a

Merge pull request #54 from bluss/benchmark

319e49e

Add benchmark runner as an "example" binary

API: Update to Rust 2018 edition

a3fd081

FIX: Use Ptr wrappers for raw pointers (mark safe to pass across thre…

cb0ca4b

…ads)

Add function that splits a range chunk in parts

eb5582b

FEAT: Add threading feature using a hierarchical thread pool

01e8ba2

FEAT: Suport nthreads 2, 3, and 4 in parallel loops

860ec38

FEAT: Add method for num_pack_a

8b39aae

TEST: Test threading feature

a761cfc

TEST: Test from 1.42

e2040fc

FIX: Let the "thread_local" function be FnOnce when threading is disa…

72d036f

…bled

FIX: Only use thread local if have std

55ffa7f

For std and threading we can use the thread_local!() macro, but for no-std we'll need to use a stack array instead.

TEST: Cleanup in gh actions file

a0343ff

FIX: Add heuristic to avoid using threads for small matrices

6b3158c

It depends a lot on hardware, when we should use threads, so even having a heuristic is risky, but we'll add one and can improve it later.

MAINT: Disable warning for unused macro

9879d9e

On non-x86, this macro can be unused.

MAINT: Enable debug info in release/bench mode

e941ba3

These are used all the time for profiling (and only affects development, so they might as well be enabled.)

DOC: Update crate docs for the threading feature

04264b0

FIX: Put threadpool and nthreads into one combined Lazy

33951b5

This is a performance fix, using one Lazy/OnceCell instead of two separate ones saves a little time - it's just a few ns - which was visible in the benchmark for (too) small matrices.

bluss and others added 30 commits April 30, 2023 12:26

complex: pack real and imag separately

9896879

Use a different pack function for complex micorkernels which puts real and imag parts in separate rows. This enables much better autovectorization for the fallback kernels.

cgemm: Setup Avx2 and Fma autovectorized kernels

6f86fd9

Custom sizes for Fma and Avx2 is a win for performance, and Avx2 does better than fma here, so both can be worthwhile.

cgemm: use fma in avx2 kernel

18bd827

avx2, fma and f32::mul_add is a success in autovectorization, while just fma with f32::mul_add is not (!). For this reason, only call f32::mul_add when we opt in to this.

cgemm: Add known-answer test

e6d04e1

cgemm: enable fma for neon

e84562d

ci: Update miri flags

84c0baa

Remove flags that are now used by default by miri.

0.3.5

258a69f

Fix nostd build

145f9e8

cgemm was not tested as nostd in ci

0.3.6

d88b19e

Remove space from file names

496f08a

For Bazel compatibility. Fixes #78

0.3.7

836e5ae

bench: Add non-contiguous layouts

d6aef69

in layout benchmarks, which are used to check packing and kernel sensitivity to memory layout, test some non-contiguous layouts.

gemm: request 8-byte buffer alignment on macos

c6f86de

A user showed that in certain configurations on macos, the TLS allocation can even be 8-byte aligned.

ci: Drop 1.41 in cross test

86f4432

Cargo cross does not support this old rust version anymore, increase cross versions.

gemm: Ensure alignment without repr(align()) on macos

7753f81

Completely distrust repr(align()) on macos and always manually ensure basic alignment.

0.3.8

e8caf74

Remove obsolete lint directive

a0bf1bb

ci: Test with cargo-careful and ThreadSanitizer

29f3d1c

ci: Update github action versions

c7ab1ac

Fix alignment in s390x and cross test

77ed4e0

Requested 32-alignment for s390x but thread local storage does not supply it. Lower requested align to 16 in general to avoid having this problem pop up on other platforms too.

0.3.9

bb3dd0b

kernel: Silence unused method warning

5b7cdcd

debugmacros: Silence unknown cfg warning

adff8c4

example/usegemm: Remove unused method

0aa4593

ci: Update cache rule for cross builder

39cb02b

ci: Pin either=1.13 for MSRV

9126d49

0.3.10

1c91e1c

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[pull] master from bluss:master #3

[pull] master from bluss:master #3

Uh oh!

pull bot commented Dec 7, 2020 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

[pull] master from bluss:master #3

Are you sure you want to change the base?

[pull] master from bluss:master #3

Uh oh!

Conversation

pull bot commented Dec 7, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

pull bot commented Dec 7, 2020 •

edited

Loading