Skip to content

Conversation

@Artoria2e5
Copy link

Lto was turned on ten months ago but with single-thread codegen, which made builds much slower than before. At a little cost to code quality (due to partitioning) we can get back much of the speed by enabling parallel codegen.

Yes, my copy of Apple Clang understands what this is, though it does not partition in this case. Clang wants a specific -flto-partitions=N flag, since it interprets -flto=auto as -flto=full, but then GCC wouldn't know what flag that is.

Lto was turned on ten months ago but with single-thread codegen, which made builds much slower than before. At a little cost to code quality (due to partitioning) we can get back much of the speed by enabling parallel codegen.

Yes, my copy of Apple Clang understands what this is, though it does not partition in this case. Clang wants a specific -flto-partitions=N flag, since it interprets -flto=auto as -flto=full, but then GCC wouldn't know what flag that is.
Copy link
Member

@tdulcet tdulcet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR.

makemake.sh Outdated
cat <<EOF >Makefile
CC ?= gcc
CFLAGS = -fdiagnostics-color -Wall -g -O3 -flto # =auto
CFLAGS = -fdiagnostics-color -Wall -g -O3 -flto=auto
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason I commented this out is that older versions of Clang do not support -flto=auto. We would either need to pass an explicit number of threads or make the =auto conditional if supported by the compiler.

Copy link
Author

@Artoria2e5 Artoria2e5 Nov 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Personally I think we should bump the minimum compiler version up a bit to also help with the ancient mac -march=native issue. But in this case though we do know the processor count for setting -j from before, so might as well do the core count.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Heck, since we are generating the Makefile anew anyways, we can do some shell magic to figure out what CC it is and talk appropriately to it. No reliance on properties makefile "if" extensions, just shell. (Yeah I failed to fall asleep oops

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bumping the minimum compiler version is not really an option, as I believe Ernst wanted the only requirement to be a C99 compiler on Unix.

Copy link
Author

@Artoria2e5 Artoria2e5 Nov 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I respect Ernst a lot, but the code in its current shape is far from "just C99" already, though not necessarily for "fancy feature" reasons. But either way if we're deviating from C99, we might as well do it neatly.

There's another thing that bumps the minimum compiler version in this very line: -fdiagnostics-color. That argument cannot hold.

Copy link
Author

@Artoria2e5 Artoria2e5 Nov 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GCC 5 is pretty good already. It has, among other things, the vector extension, if I recall correctly! It was also past the C++11 split though that has nothing to do with us. It's much better than 2.95 is what I'm saying -- I remember all sorts of pain associated with that from elsewhere. Anyways. I will keep that in mind now that there is a known minimum version.

In that case yes, a test-compile might be justifiable, especially since Apple Clang has a separate version scheme from regular clang (when will it stop being within 10 years of so? :)). I've.... stopped using the core 2 mac I used to have some time in ~2017, so I don't remember much about what the latest XCode for it offers. Might need to spin up a VM.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Testing with older OSs is difficult. I have this same problem with AutoPrimeNet. Ubuntu does provide Docker containers for each of their releases, which could allow us to at least verify that Mlucas builds, but it would require adding a lot of CI jobs. For example, to test each LTS release for the last 10 years (14.04, 16.04, 18.04, 20.04, 22.04, 24.04) with each architecture (x86, ARM) and with both GCC and Clang would require 24 jobs. I am not aware of any free VMs for macOS or Windows. Ernst did have several older systems he used to manually test with, including a "Core2Duo Mac" (per his comments).

Copy link
Author

@Artoria2e5 Artoria2e5 Nov 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Testing with an older compiler hits about 90% of the issues we want to test with an older OS, save for actual system-level bugs like... I don't know, wrong handling of widened vector registers, but the compiler and assembler are not going to understand the new instruction sets anyways. 10-year-old compiler binaries do largely work on current OS.

That said, I don't expect something that has worked in an old version and a current version to start failing somewhere in between. It could happen, but it would likely not be our fault but someone else's bug.

In the case of mac things: apple rightfully gets a lot of flak for comparmentalizing everything, but in this case it also makes the solution for getting an old compiler simple: grab an old xcode dmg, extract the .app, and use the SDK prefix from within. https://xcodereleases.com/?q=10.9 takes you to a version from 2015 (XCode 6.2).

(When I had my "core2duo mac" I just did ports install gcc and got the new compiler. When clang became established I did the same. But welp, macports does not support macOS 10.7 any more, I think.)


About the job-count issue: I desire a two-tiered CI check: an abbreviated one [without needing approval] for quick checks (perhaps just 5-year-old Linux, current Linux, Windows, Mac) and the full thing. The free GitHub action runner only has this much job concurrency and wait times have become a bit too high. I notice that early-fail is disabled; perhaps it should be on for the abbreviated version (unless it's disabled for log-preserving reasons, then I say we need to make our own early-fail logic).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting, I do not have access to any macOS systems myself, so I rely entirely on the CI for testing.

We could create a two tiered CI check, by making jobs dependent on other jobs. In the successful case, it may make the CI take longer since fewer jobs could run in parallel, but it would allow the CI to fail quicker.

The approval is required by GitHub and I do not believe there is anything we could do about that. However, once your first PR is merged and you are marked as a "contributor" for this repository, it should no longer be required.

The fast fail feature is disabled because it can be useful to see which combinations of OS, architecture and compiler are affected by an issue, to reduce the amount of manual debugging required. Some of the crashes/seg faults can also be intermittent or nondeterministic.

Copy link
Member

@tdulcet tdulcet Nov 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated the CI to test with older Linux (Ubuntu 16.04) on both x86 and ARM, with both GCC 5.4 and Clang 3.8. 3 of these 4 jobs currently fail. I tried using Ubuntu 14.04 as well, but it resulted in additional errors due to the old version of Make, GCC, Clang and the hwloc library, so it should be easier to start with Ubuntu 16.04. GCC on x86 with AVX2 fails with error: 'asm' operand has impossible constraints, although other SIMD build modes are successful. GCC on ARM has an unspecified linker error. Clang on x86 is successful. Clang on ARM fails with error: the clang compiler does not support '-mcpu=native'. It may not be possible to support older Clang on ARM when using ASIMD.

Anyway, you may want to rebase this PR, so we could test that your changes do not regress anything.

@tdulcet tdulcet added the enhancement New feature or request label Nov 4, 2025
Artoria2e5 and others added 3 commits November 5, 2025 01:48
Now we don't have to maintain the lto regex at all!

Also:
* balance brackets on case, because someone complained :)
* lto flag selection in makemake by compiler ver
makemake.sh Outdated
# Optional compile [optimize] args
C_ARGS=(-fdiagnostics-color -Wall -g)
C_ARGS_OPT=(-g -O3)
C_ARGS_DEBUG=(-g -Og '-fsanitize=address,undefined')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that the CI was testing builds with both Address Sanitizer/Undefined Behavior Santizer and Thread Sanitizer, but this only supports ASan/UBSan. Also, the -g option is duplicated.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be more appropriate to add a positional argument for sanitizer then.

Comment on lines +101 to +106
if [[ $CCVER == *"clang"* ]]; then
echo "Clang detected as C compiler. Enabling ThinLTO for multithreaded codegen."
C_ARGS_LTO+=(-flto=thin)
elif [[ $CCVER == *"gcc"* ]]; then
echo "GCC detected as C compiler. Enabling multithreaded LTO."
C_ARGS_LTO+=(-flto="$CPU_THREADS")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not believe we would want to use ThinLTO for Clang, as it would provide less performance. Also, using -flto=auto should work with GCC:

Suggested change
if [[ $CCVER == *"clang"* ]]; then
echo "Clang detected as C compiler. Enabling ThinLTO for multithreaded codegen."
C_ARGS_LTO+=(-flto=thin)
elif [[ $CCVER == *"gcc"* ]]; then
echo "GCC detected as C compiler. Enabling multithreaded LTO."
C_ARGS_LTO+=(-flto="$CPU_THREADS")
if [[ $CCVER == *clang* ]]; then
echo "Clang detected as C compiler. Enabling multithreaded LTO."
C_ARGS_LTO+=(-flto="$CPU_THREADS")
elif [[ $CCVER == *gcc* ]]; then
echo "GCC detected as C compiler. Enabling multithreaded LTO."
C_ARGS_LTO+=(-flto=auto)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are multiple steps of "less performance" at play here. With GCC flto=n/auto we are already at the mercy of the LTO partitioner, though we can perhaps argue that it's fine. I tend to believe that thinLTO is an acceptable compromise as much as partitioned LTO, but of course we could also just add another positional parameter to request full, nonpartitioned, LTO.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As for why not =n on clang: clang 8.4.0 has =thin but not =n, nor any partitioning options. Yes I'm for nudging the user towards getting a newer clang, but I'm also a bit of a coward…

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the additional information.

Because Mlucas already slower than Prime95, we typically want the maximum possible runtime performance, even if it increases the build times. However, improving the build times can also be useful in some cases. I will let you decide what LTO options we should use, unless one of the other reviewers has an opinion.

Comment on lines +118 to +121
# Only supported on ELF platforms, but is harmless on not-mac since the compiler is not ancient:
CC_ARGS+=(-gz)
# Could make output smaller
LD_ARGS+=("-Wl,-O1")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you explain more about what you are trying to achieve with these new options? If binary size is a concern, we could make the -g option optional, but it is helpful for debugging a crash, without needing to recompile.

Copy link
Author

@Artoria2e5 Artoria2e5 Nov 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

gz is binary size reduction for free. You zlib/zstd the DWARF and that's it. One fast process, nothing lost at all.

ld optimization is the same.

gz allow you to get debuginfo at lower cost, potentially even making distributing -g3 in the future viable (though optimization would likely make that level of debuginfo useless).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the information. For performance, it is recommended to build Mlucas on each system it is going to be run on, instead of cross compiling it, so the script can use the -march=native option. Thus, binary size is usually not much of a concern. However, your PR adds support for multiple debug levels, so maybe the first level could add the debug symbols, the second could disable LTO and the third could disable optimization and enable the selected sanitizer(s).


MAKE_ARGS+=(-j "$CPU_THREADS")

CCVER=$($CC --version | head -n1)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is somewhat brittle, so we may want to do this more similar to how we detect the architecture on Windows and compile/run a simple C program instead, which would allow us to check for the various preprocessor defines.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is good unless we encounter gcc 2.95, which is going to break for some other reason anyways. We are not detecting anything about the system here: we are asking about the compiler specifically, not even the include files, so let's trust the compiler on it.

makemake.sh Outdated
cat <<EOF >Makefile
CC ?= gcc
CFLAGS = -fdiagnostics-color -Wall -g -O3 -flto # =auto
CFLAGS = -fdiagnostics-color -Wall -g -O3 -flto=auto
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bumping the minimum compiler version is not really an option, as I believe Ernst wanted the only requirement to be a C99 compiler on Unix.

@Artoria2e5 Artoria2e5 requested a review from tdulcet November 6, 2025 01:59
@tdulcet tdulcet force-pushed the main branch 2 times, most recently from 07df4ce to 53cf9dc Compare November 20, 2025 17:32
Comment on lines +154 to +155
export XCFLAGS=-fsanitize=address,undefined
bash -e -o pipefail -- makemake.sh use_hwloc debug
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since it is only used for the makemake.sh script, the export is unnecessary:

Suggested change
export XCFLAGS=-fsanitize=address,undefined
bash -e -o pipefail -- makemake.sh use_hwloc debug
XCFLAGS=-fsanitize=address,undefined bash -e -o pipefail -- makemake.sh use_hwloc debug

Same below.

Comment on lines +52 to +53
mapfile -t C_ARGS_ADD <<< "$XCFLAGS"
C_ARGS+=("${C_ARGS_ADD[@]}")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The XCFLAGS variable is only ever going to be a single line, so splitting it by lines and converting it to an array is unnecessary:

Suggested change
mapfile -t C_ARGS_ADD <<< "$XCFLAGS"
C_ARGS+=("${C_ARGS_ADD[@]}")
if [[ -n $XCFLAGS ]]; then
C_ARGS+=("$XCFLAGS")
fi

Comment on lines +184 to +190
else
C_ARGS+=("${C_ARGS_OPT[@]}")
fi

if ((DEBUG < 2)); then
C_ARGS+=("${C_ARGS_LTO[@]}")
fi
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would not recommend mixing debug with LTO, especially when using the sanitizers:

Suggested change
else
C_ARGS+=("${C_ARGS_OPT[@]}")
fi
if ((DEBUG < 2)); then
C_ARGS+=("${C_ARGS_LTO[@]}")
fi
else
C_ARGS+=("${C_ARGS_OPT[@]}" "${C_ARGS_LTO[@]}")
fi

Using LTO can also disable some compiler warnings.

Comment on lines -349 to +390
ARGS+=($output)
mapfile -t output_array <<<"$output"
C_ARGS+=("${output_array[@]}")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not equivalent, as mapfile splits by newlines, but this was splitting by all whitespace. Since we are pasting the value in the Makefile anyway, it should be fine to just quote the variable and disable the splitting, if you want to resolve the ShellCheck warning.

LD_ARGS+=(-lrt)
fi
# Only supported on ELF platforms, but is harmless on not-mac since the compiler is not ancient:
CC_ARGS+=(-gz)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you mean:

Suggested change
CC_ARGS+=(-gz)
C_ARGS+=(-gz)

\$(CC) \$(CFLAGS) \$(CPPFLAGS) -c -DFACTOR_STANDALONE $WORDS -DTRYQ=4 \$<
%.o: ../src/%.c
\$(CC) \$(CFLAGS) \$(CPPFLAGS) -c ${ARGS[@]} \$<
\$(CC) \$(CFLAGS) \$(CPPFLAGS) -c \$<
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is slightly different, as we are now passing the arguments in ARGS (now C_ARGS) everywhere in the Makefile that CFLAGS is used. Notably, this would include additional flags when linking, which may need to be tested.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants