-
Notifications
You must be signed in to change notification settings - Fork 4
[SIMD] Model weights updating using AVX Instructions #11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I ran some performance tests using the perf-testing branch and got the following results:
| Code version | Num weights | Num Threads | Time (s) |
|---|---|---|---|
| Non SIMD | 6000 | 1 | 132.79 |
| SIMD | 6000 | 1 | 134.97 |
| Non SIMD | 15000 | 1 | 843.7 |
| SIMD | 15000 | 1 | 858.7 |
I'll continue investigating to see why there is no performance improvement
| if (k == g_accumulator.size() - 1 && iters_sum > 0) { | ||
| const float iters_sum_arr[8] = {iters_sum, iters_sum, iters_sum, iters_sum, | ||
| iters_sum, iters_sum, iters_sum, iters_sum}; | ||
| iters_sum_slice = _mm256_loadu_ps(iters_sum_arr); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you can use _mm256_broadcast_ss(&iters_sum) instead of _mm256_loadu_ps. That way you don't have to first create an array. I tested this and it builds/runs, but you may want to check correctness.
| } | ||
| const float n_iter_arr[8] = {n_iter, n_iter, n_iter, n_iter, | ||
| n_iter, n_iter, n_iter, n_iter}; | ||
| __m256 n_iter_slice = _mm256_loadu_ps(n_iter_arr); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here
| continue; // Didn't receive this variable from any clients | ||
| } | ||
| // Multiple the weights by local iterations and update g_old_params[v_name]. | ||
| for (int i = 0; i < acc_params[v_name].size() / 8 * 8; i += 8) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This variable i has already been used in the outer for loop
| _mm256_storeu_ps(g_old_params[v_name].data() + i, updated_old_params_v_name_slice); | ||
| } | ||
| // Tail case. | ||
| for (int i = acc_params[v_name].size() / 8 * 8; i < acc_params[v_name].size(); i++) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See comment above
This PR includes further optimizations of the aggregation code by performing arithmetic operations using Intel's AVX 256-bit instructions. It also includes more minor optimizations regarding non-necessary copying of data, a logic change to avoid redundant loads and stores, and the conversion of double types to floats. Code used to test performance found here: https://github.com/octaviansima/secure-aggregation/blob/perf-testing/server/tests/host_test.cpp
Note that the PR is huge only due to the inclusion of Intel Intrinsics header files required for compilation.