Skip to content

Conversation

@octaviansima
Copy link
Collaborator

@octaviansima octaviansima commented Dec 12, 2020

This PR includes further optimizations of the aggregation code by performing arithmetic operations using Intel's AVX 256-bit instructions. It also includes more minor optimizations regarding non-necessary copying of data, a logic change to avoid redundant loads and stores, and the conversion of double types to floats. Code used to test performance found here: https://github.com/octaviansima/secure-aggregation/blob/perf-testing/server/tests/host_test.cpp

Note that the PR is huge only due to the inclusion of Intel Intrinsics header files required for compilation.

Copy link
Member

@chester-leung chester-leung left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ran some performance tests using the perf-testing branch and got the following results:

Code version Num weights Num Threads Time (s)
Non SIMD 6000 1 132.79
SIMD 6000 1 134.97
Non SIMD 15000 1 843.7
SIMD 15000 1 858.7

I'll continue investigating to see why there is no performance improvement

if (k == g_accumulator.size() - 1 && iters_sum > 0) {
const float iters_sum_arr[8] = {iters_sum, iters_sum, iters_sum, iters_sum,
iters_sum, iters_sum, iters_sum, iters_sum};
iters_sum_slice = _mm256_loadu_ps(iters_sum_arr);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can use _mm256_broadcast_ss(&iters_sum) instead of _mm256_loadu_ps. That way you don't have to first create an array. I tested this and it builds/runs, but you may want to check correctness.

}
const float n_iter_arr[8] = {n_iter, n_iter, n_iter, n_iter,
n_iter, n_iter, n_iter, n_iter};
__m256 n_iter_slice = _mm256_loadu_ps(n_iter_arr);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here

continue; // Didn't receive this variable from any clients
}
// Multiple the weights by local iterations and update g_old_params[v_name].
for (int i = 0; i < acc_params[v_name].size() / 8 * 8; i += 8) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This variable i has already been used in the outer for loop

_mm256_storeu_ps(g_old_params[v_name].data() + i, updated_old_params_v_name_slice);
}
// Tail case.
for (int i = acc_params[v_name].size() / 8 * 8; i < acc_params[v_name].size(); i++) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See comment above

@podcastinator
Copy link
Member

Superseded by #12 and #13.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants