-
Notifications
You must be signed in to change notification settings - Fork 7
Parallel lto codegen #56
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Lto was turned on ten months ago but with single-thread codegen, which made builds much slower than before. At a little cost to code quality (due to partitioning) we can get back much of the speed by enabling parallel codegen. Yes, my copy of Apple Clang understands what this is, though it does not partition in this case. Clang wants a specific -flto-partitions=N flag, since it interprets -flto=auto as -flto=full, but then GCC wouldn't know what flag that is.
tdulcet
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR.
makemake.sh
Outdated
| cat <<EOF >Makefile | ||
| CC ?= gcc | ||
| CFLAGS = -fdiagnostics-color -Wall -g -O3 -flto # =auto | ||
| CFLAGS = -fdiagnostics-color -Wall -g -O3 -flto=auto |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason I commented this out is that older versions of Clang do not support -flto=auto. We would either need to pass an explicit number of threads or make the =auto conditional if supported by the compiler.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Personally I think we should bump the minimum compiler version up a bit to also help with the ancient mac -march=native issue. But in this case though we do know the processor count for setting -j from before, so might as well do the core count.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Heck, since we are generating the Makefile anew anyways, we can do some shell magic to figure out what CC it is and talk appropriately to it. No reliance on properties makefile "if" extensions, just shell. (Yeah I failed to fall asleep oops
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bumping the minimum compiler version is not really an option, as I believe Ernst wanted the only requirement to be a C99 compiler on Unix.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I respect Ernst a lot, but the code in its current shape is far from "just C99" already, though not necessarily for "fancy feature" reasons. But either way if we're deviating from C99, we might as well do it neatly.
There's another thing that bumps the minimum compiler version in this very line: -fdiagnostics-color. That argument cannot hold.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
GCC 5 is pretty good already. It has, among other things, the vector extension, if I recall correctly! It was also past the C++11 split though that has nothing to do with us. It's much better than 2.95 is what I'm saying -- I remember all sorts of pain associated with that from elsewhere. Anyways. I will keep that in mind now that there is a known minimum version.
In that case yes, a test-compile might be justifiable, especially since Apple Clang has a separate version scheme from regular clang (when will it stop being within 10 years of so? :)). I've.... stopped using the core 2 mac I used to have some time in ~2017, so I don't remember much about what the latest XCode for it offers. Might need to spin up a VM.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Testing with older OSs is difficult. I have this same problem with AutoPrimeNet. Ubuntu does provide Docker containers for each of their releases, which could allow us to at least verify that Mlucas builds, but it would require adding a lot of CI jobs. For example, to test each LTS release for the last 10 years (14.04, 16.04, 18.04, 20.04, 22.04, 24.04) with each architecture (x86, ARM) and with both GCC and Clang would require 24 jobs. I am not aware of any free VMs for macOS or Windows. Ernst did have several older systems he used to manually test with, including a "Core2Duo Mac" (per his comments).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Testing with an older compiler hits about 90% of the issues we want to test with an older OS, save for actual system-level bugs like... I don't know, wrong handling of widened vector registers, but the compiler and assembler are not going to understand the new instruction sets anyways. 10-year-old compiler binaries do largely work on current OS.
That said, I don't expect something that has worked in an old version and a current version to start failing somewhere in between. It could happen, but it would likely not be our fault but someone else's bug.
In the case of mac things: apple rightfully gets a lot of flak for comparmentalizing everything, but in this case it also makes the solution for getting an old compiler simple: grab an old xcode dmg, extract the .app, and use the SDK prefix from within. https://xcodereleases.com/?q=10.9 takes you to a version from 2015 (XCode 6.2).
(When I had my "core2duo mac" I just did ports install gcc and got the new compiler. When clang became established I did the same. But welp, macports does not support macOS 10.7 any more, I think.)
About the job-count issue: I desire a two-tiered CI check: an abbreviated one [without needing approval] for quick checks (perhaps just 5-year-old Linux, current Linux, Windows, Mac) and the full thing. The free GitHub action runner only has this much job concurrency and wait times have become a bit too high. I notice that early-fail is disabled; perhaps it should be on for the abbreviated version (unless it's disabled for log-preserving reasons, then I say we need to make our own early-fail logic).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting, I do not have access to any macOS systems myself, so I rely entirely on the CI for testing.
We could create a two tiered CI check, by making jobs dependent on other jobs. In the successful case, it may make the CI take longer since fewer jobs could run in parallel, but it would allow the CI to fail quicker.
The approval is required by GitHub and I do not believe there is anything we could do about that. However, once your first PR is merged and you are marked as a "contributor" for this repository, it should no longer be required.
The fast fail feature is disabled because it can be useful to see which combinations of OS, architecture and compiler are affected by an issue, to reduce the amount of manual debugging required. Some of the crashes/seg faults can also be intermittent or nondeterministic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I updated the CI to test with older Linux (Ubuntu 16.04) on both x86 and ARM, with both GCC 5.4 and Clang 3.8. 3 of these 4 jobs currently fail. I tried using Ubuntu 14.04 as well, but it resulted in additional errors due to the old version of Make, GCC, Clang and the hwloc library, so it should be easier to start with Ubuntu 16.04. GCC on x86 with AVX2 fails with error: 'asm' operand has impossible constraints, although other SIMD build modes are successful. GCC on ARM has an unspecified linker error. Clang on x86 is successful. Clang on ARM fails with error: the clang compiler does not support '-mcpu=native'. It may not be possible to support older Clang on ARM when using ASIMD.
Anyway, you may want to rebase this PR, so we could test that your changes do not regress anything.
Co-authored-by: Teal Dulcet <[email protected]>
Now we don't have to maintain the lto regex at all! Also: * balance brackets on case, because someone complained :) * lto flag selection in makemake by compiler ver
makemake.sh
Outdated
| # Optional compile [optimize] args | ||
| C_ARGS=(-fdiagnostics-color -Wall -g) | ||
| C_ARGS_OPT=(-g -O3) | ||
| C_ARGS_DEBUG=(-g -Og '-fsanitize=address,undefined') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that the CI was testing builds with both Address Sanitizer/Undefined Behavior Santizer and Thread Sanitizer, but this only supports ASan/UBSan. Also, the -g option is duplicated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might be more appropriate to add a positional argument for sanitizer then.
| if [[ $CCVER == *"clang"* ]]; then | ||
| echo "Clang detected as C compiler. Enabling ThinLTO for multithreaded codegen." | ||
| C_ARGS_LTO+=(-flto=thin) | ||
| elif [[ $CCVER == *"gcc"* ]]; then | ||
| echo "GCC detected as C compiler. Enabling multithreaded LTO." | ||
| C_ARGS_LTO+=(-flto="$CPU_THREADS") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not believe we would want to use ThinLTO for Clang, as it would provide less performance. Also, using -flto=auto should work with GCC:
| if [[ $CCVER == *"clang"* ]]; then | |
| echo "Clang detected as C compiler. Enabling ThinLTO for multithreaded codegen." | |
| C_ARGS_LTO+=(-flto=thin) | |
| elif [[ $CCVER == *"gcc"* ]]; then | |
| echo "GCC detected as C compiler. Enabling multithreaded LTO." | |
| C_ARGS_LTO+=(-flto="$CPU_THREADS") | |
| if [[ $CCVER == *clang* ]]; then | |
| echo "Clang detected as C compiler. Enabling multithreaded LTO." | |
| C_ARGS_LTO+=(-flto="$CPU_THREADS") | |
| elif [[ $CCVER == *gcc* ]]; then | |
| echo "GCC detected as C compiler. Enabling multithreaded LTO." | |
| C_ARGS_LTO+=(-flto=auto) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are multiple steps of "less performance" at play here. With GCC flto=n/auto we are already at the mercy of the LTO partitioner, though we can perhaps argue that it's fine. I tend to believe that thinLTO is an acceptable compromise as much as partitioned LTO, but of course we could also just add another positional parameter to request full, nonpartitioned, LTO.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As for why not =n on clang: clang 8.4.0 has =thin but not =n, nor any partitioning options. Yes I'm for nudging the user towards getting a newer clang, but I'm also a bit of a coward…
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the additional information.
Because Mlucas already slower than Prime95, we typically want the maximum possible runtime performance, even if it increases the build times. However, improving the build times can also be useful in some cases. I will let you decide what LTO options we should use, unless one of the other reviewers has an opinion.
| # Only supported on ELF platforms, but is harmless on not-mac since the compiler is not ancient: | ||
| CC_ARGS+=(-gz) | ||
| # Could make output smaller | ||
| LD_ARGS+=("-Wl,-O1") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you explain more about what you are trying to achieve with these new options? If binary size is a concern, we could make the -g option optional, but it is helpful for debugging a crash, without needing to recompile.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
gz is binary size reduction for free. You zlib/zstd the DWARF and that's it. One fast process, nothing lost at all.
ld optimization is the same.
gz allow you to get debuginfo at lower cost, potentially even making distributing -g3 in the future viable (though optimization would likely make that level of debuginfo useless).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the information. For performance, it is recommended to build Mlucas on each system it is going to be run on, instead of cross compiling it, so the script can use the -march=native option. Thus, binary size is usually not much of a concern. However, your PR adds support for multiple debug levels, so maybe the first level could add the debug symbols, the second could disable LTO and the third could disable optimization and enable the selected sanitizer(s).
|
|
||
| MAKE_ARGS+=(-j "$CPU_THREADS") | ||
|
|
||
| CCVER=$($CC --version | head -n1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is somewhat brittle, so we may want to do this more similar to how we detect the architecture on Windows and compile/run a simple C program instead, which would allow us to check for the various preprocessor defines.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is good unless we encounter gcc 2.95, which is going to break for some other reason anyways. We are not detecting anything about the system here: we are asking about the compiler specifically, not even the include files, so let's trust the compiler on it.
makemake.sh
Outdated
| cat <<EOF >Makefile | ||
| CC ?= gcc | ||
| CFLAGS = -fdiagnostics-color -Wall -g -O3 -flto # =auto | ||
| CFLAGS = -fdiagnostics-color -Wall -g -O3 -flto=auto |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bumping the minimum compiler version is not really an option, as I believe Ernst wanted the only requirement to be a C99 compiler on Unix.
Co-authored-by: Teal Dulcet <[email protected]>
07df4ce to
53cf9dc
Compare
| export XCFLAGS=-fsanitize=address,undefined | ||
| bash -e -o pipefail -- makemake.sh use_hwloc debug |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since it is only used for the makemake.sh script, the export is unnecessary:
| export XCFLAGS=-fsanitize=address,undefined | |
| bash -e -o pipefail -- makemake.sh use_hwloc debug | |
| XCFLAGS=-fsanitize=address,undefined bash -e -o pipefail -- makemake.sh use_hwloc debug |
Same below.
| mapfile -t C_ARGS_ADD <<< "$XCFLAGS" | ||
| C_ARGS+=("${C_ARGS_ADD[@]}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The XCFLAGS variable is only ever going to be a single line, so splitting it by lines and converting it to an array is unnecessary:
| mapfile -t C_ARGS_ADD <<< "$XCFLAGS" | |
| C_ARGS+=("${C_ARGS_ADD[@]}") | |
| if [[ -n $XCFLAGS ]]; then | |
| C_ARGS+=("$XCFLAGS") | |
| fi |
| else | ||
| C_ARGS+=("${C_ARGS_OPT[@]}") | ||
| fi | ||
|
|
||
| if ((DEBUG < 2)); then | ||
| C_ARGS+=("${C_ARGS_LTO[@]}") | ||
| fi |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would not recommend mixing debug with LTO, especially when using the sanitizers:
| else | |
| C_ARGS+=("${C_ARGS_OPT[@]}") | |
| fi | |
| if ((DEBUG < 2)); then | |
| C_ARGS+=("${C_ARGS_LTO[@]}") | |
| fi | |
| else | |
| C_ARGS+=("${C_ARGS_OPT[@]}" "${C_ARGS_LTO[@]}") | |
| fi |
Using LTO can also disable some compiler warnings.
| ARGS+=($output) | ||
| mapfile -t output_array <<<"$output" | ||
| C_ARGS+=("${output_array[@]}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not equivalent, as mapfile splits by newlines, but this was splitting by all whitespace. Since we are pasting the value in the Makefile anyway, it should be fine to just quote the variable and disable the splitting, if you want to resolve the ShellCheck warning.
| LD_ARGS+=(-lrt) | ||
| fi | ||
| # Only supported on ELF platforms, but is harmless on not-mac since the compiler is not ancient: | ||
| CC_ARGS+=(-gz) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you mean:
| CC_ARGS+=(-gz) | |
| C_ARGS+=(-gz) |
| \$(CC) \$(CFLAGS) \$(CPPFLAGS) -c -DFACTOR_STANDALONE $WORDS -DTRYQ=4 \$< | ||
| %.o: ../src/%.c | ||
| \$(CC) \$(CFLAGS) \$(CPPFLAGS) -c ${ARGS[@]} \$< | ||
| \$(CC) \$(CFLAGS) \$(CPPFLAGS) -c \$< |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is slightly different, as we are now passing the arguments in ARGS (now C_ARGS) everywhere in the Makefile that CFLAGS is used. Notably, this would include additional flags when linking, which may need to be tested.
Lto was turned on ten months ago but with single-thread codegen, which made builds much slower than before. At a little cost to code quality (due to partitioning) we can get back much of the speed by enabling parallel codegen.
Yes, my copy of Apple Clang understands what this is, though it does not partition in this case. Clang wants a specific -flto-partitions=N flag, since it interprets -flto=auto as -flto=full, but then GCC wouldn't know what flag that is.