Parallel lto codegen #56

Artoria2e5 · 2025-11-04T11:52:01Z

Lto was turned on ten months ago but with single-thread codegen, which made builds much slower than before. At a little cost to code quality (due to partitioning) we can get back much of the speed by enabling parallel codegen.

Yes, my copy of Apple Clang understands what this is, though it does not partition in this case. Clang wants a specific -flto-partitions=N flag, since it interprets -flto=auto as -flto=full, but then GCC wouldn't know what flag that is.

Lto was turned on ten months ago but with single-thread codegen, which made builds much slower than before. At a little cost to code quality (due to partitioning) we can get back much of the speed by enabling parallel codegen. Yes, my copy of Apple Clang understands what this is, though it does not partition in this case. Clang wants a specific -flto-partitions=N flag, since it interprets -flto=auto as -flto=full, but then GCC wouldn't know what flag that is.

tdulcet

Thanks for the PR.

tdulcet · 2025-11-04T13:45:50Z

makemake.sh

 cat <<EOF >Makefile
 CC ?= gcc
-CFLAGS = -fdiagnostics-color -Wall -g -O3 -flto # =auto
+CFLAGS = -fdiagnostics-color -Wall -g -O3 -flto=auto


The reason I commented this out is that older versions of Clang do not support -flto=auto. We would either need to pass an explicit number of threads or make the =auto conditional if supported by the compiler.

Personally I think we should bump the minimum compiler version up a bit to also help with the ancient mac -march=native issue. But in this case though we do know the processor count for setting -j from before, so might as well do the core count.

Heck, since we are generating the Makefile anew anyways, we can do some shell magic to figure out what CC it is and talk appropriately to it. No reliance on properties makefile "if" extensions, just shell. (Yeah I failed to fall asleep oops

Bumping the minimum compiler version is not really an option, as I believe Ernst wanted the only requirement to be a C99 compiler on Unix.

I respect Ernst a lot, but the code in its current shape is far from "just C99" already, though not necessarily for "fancy feature" reasons. But either way if we're deviating from C99, we might as well do it neatly.

There's another thing that bumps the minimum compiler version in this very line: -fdiagnostics-color. That argument cannot hold.

GCC 5 is pretty good already. It has, among other things, the vector extension, if I recall correctly! It was also past the C++11 split though that has nothing to do with us. It's much better than 2.95 is what I'm saying -- I remember all sorts of pain associated with that from elsewhere. Anyways. I will keep that in mind now that there is a known minimum version.

In that case yes, a test-compile might be justifiable, especially since Apple Clang has a separate version scheme from regular clang (when will it stop being within 10 years of so? :)). I've.... stopped using the core 2 mac I used to have some time in ~2017, so I don't remember much about what the latest XCode for it offers. Might need to spin up a VM.

Testing with older OSs is difficult. I have this same problem with AutoPrimeNet. Ubuntu does provide Docker containers for each of their releases, which could allow us to at least verify that Mlucas builds, but it would require adding a lot of CI jobs. For example, to test each LTS release for the last 10 years (14.04, 16.04, 18.04, 20.04, 22.04, 24.04) with each architecture (x86, ARM) and with both GCC and Clang would require 24 jobs. I am not aware of any free VMs for macOS or Windows. Ernst did have several older systems he used to manually test with, including a "Core2Duo Mac" (per his comments).

Testing with an older compiler hits about 90% of the issues we want to test with an older OS, save for actual system-level bugs like... I don't know, wrong handling of widened vector registers, but the compiler and assembler are not going to understand the new instruction sets anyways. 10-year-old compiler binaries do largely work on current OS.

That said, I don't expect something that has worked in an old version and a current version to start failing somewhere in between. It could happen, but it would likely not be our fault but someone else's bug.

In the case of mac things: apple rightfully gets a lot of flak for comparmentalizing everything, but in this case it also makes the solution for getting an old compiler simple: grab an old xcode dmg, extract the .app, and use the SDK prefix from within. https://xcodereleases.com/?q=10.9 takes you to a version from 2015 (XCode 6.2).

(When I had my "core2duo mac" I just did ports install gcc and got the new compiler. When clang became established I did the same. But welp, macports does not support macOS 10.7 any more, I think.)

About the job-count issue: I desire a two-tiered CI check: an abbreviated one [without needing approval] for quick checks (perhaps just 5-year-old Linux, current Linux, Windows, Mac) and the full thing. The free GitHub action runner only has this much job concurrency and wait times have become a bit too high. I notice that early-fail is disabled; perhaps it should be on for the abbreviated version (unless it's disabled for log-preserving reasons, then I say we need to make our own early-fail logic).

Interesting, I do not have access to any macOS systems myself, so I rely entirely on the CI for testing.

We could create a two tiered CI check, by making jobs dependent on other jobs. In the successful case, it may make the CI take longer since fewer jobs could run in parallel, but it would allow the CI to fail quicker.

The approval is required by GitHub and I do not believe there is anything we could do about that. However, once your first PR is merged and you are marked as a "contributor" for this repository, it should no longer be required.

The fast fail feature is disabled because it can be useful to see which combinations of OS, architecture and compiler are affected by an issue, to reduce the amount of manual debugging required. Some of the crashes/seg faults can also be intermittent or nondeterministic.

I updated the CI to test with older Linux (Ubuntu 16.04) on both x86 and ARM, with both GCC 5.4 and Clang 3.8. 3 of these 4 jobs currently fail. I tried using Ubuntu 14.04 as well, but it resulted in additional errors due to the old version of Make, GCC, Clang and the hwloc library, so it should be easier to start with Ubuntu 16.04. GCC on x86 with AVX2 fails with error: 'asm' operand has impossible constraints, although other SIMD build modes are successful. GCC on ARM has an unspecified linker error. Clang on x86 is successful. Clang on ARM fails with error: the clang compiler does not support '-mcpu=native'. It may not be possible to support older Clang on ARM when using ASIMD.

Anyway, you may want to rebase this PR, so we could test that your changes do not regress anything.

.github/workflows/ci.yml

Co-authored-by: Teal Dulcet <[email protected]>

Now we don't have to maintain the lto regex at all! Also: * balance brackets on case, because someone complained :) * lto flag selection in makemake by compiler ver

tdulcet · 2025-11-05T11:37:42Z

makemake.sh

+# Optional compile [optimize] args
+C_ARGS=(-fdiagnostics-color -Wall -g)
+C_ARGS_OPT=(-g -O3)
+C_ARGS_DEBUG=(-g -Og '-fsanitize=address,undefined')


Note that the CI was testing builds with both Address Sanitizer/Undefined Behavior Santizer and Thread Sanitizer, but this only supports ASan/UBSan. Also, the -g option is duplicated.

Might be more appropriate to add a positional argument for sanitizer then.

tdulcet · 2025-11-05T11:43:15Z

makemake.sh

+if [[ $CCVER == *"clang"* ]]; then
+	echo "Clang detected as C compiler. Enabling ThinLTO for multithreaded codegen."
+	C_ARGS_LTO+=(-flto=thin)
+elif [[ $CCVER == *"gcc"* ]]; then
+	echo "GCC detected as C compiler. Enabling multithreaded LTO."
+	C_ARGS_LTO+=(-flto="$CPU_THREADS")


I do not believe we would want to use ThinLTO for Clang, as it would provide less performance. Also, using -flto=auto should work with GCC:

Suggested change

if [[ $CCVER == *"clang"* ]]; then

echo "Clang detected as C compiler. Enabling ThinLTO for multithreaded codegen."

C_ARGS_LTO+=(-flto=thin)

elif [[ $CCVER == *"gcc"* ]]; then

echo "GCC detected as C compiler. Enabling multithreaded LTO."

C_ARGS_LTO+=(-flto="$CPU_THREADS")

if [[ $CCVER == *clang* ]]; then

echo "Clang detected as C compiler. Enabling multithreaded LTO."

C_ARGS_LTO+=(-flto="$CPU_THREADS")

elif [[ $CCVER == *gcc* ]]; then

echo "GCC detected as C compiler. Enabling multithreaded LTO."

C_ARGS_LTO+=(-flto=auto)

There are multiple steps of "less performance" at play here. With GCC flto=n/auto we are already at the mercy of the LTO partitioner, though we can perhaps argue that it's fine. I tend to believe that thinLTO is an acceptable compromise as much as partitioned LTO, but of course we could also just add another positional parameter to request full, nonpartitioned, LTO.

As for why not =n on clang: clang 8.4.0 has =thin but not =n, nor any partitioning options. Yes I'm for nudging the user towards getting a newer clang, but I'm also a bit of a coward…

Thanks for the additional information.

Because Mlucas already slower than Prime95, we typically want the maximum possible runtime performance, even if it increases the build times. However, improving the build times can also be useful in some cases. I will let you decide what LTO options we should use, unless one of the other reviewers has an opinion.

tdulcet · 2025-11-05T12:02:33Z

makemake.sh

+	# Only supported on ELF platforms, but is harmless on not-mac since the compiler is not ancient:
+	CC_ARGS+=(-gz)
+	# Could make output smaller
+	LD_ARGS+=("-Wl,-O1")


Could you explain more about what you are trying to achieve with these new options? If binary size is a concern, we could make the -g option optional, but it is helpful for debugging a crash, without needing to recompile.

gz is binary size reduction for free. You zlib/zstd the DWARF and that's it. One fast process, nothing lost at all.

ld optimization is the same.

gz allow you to get debuginfo at lower cost, potentially even making distributing -g3 in the future viable (though optimization would likely make that level of debuginfo useless).

Thanks for the information. For performance, it is recommended to build Mlucas on each system it is going to be run on, instead of cross compiling it, so the script can use the -march=native option. Thus, binary size is usually not much of a concern. However, your PR adds support for multiple debug levels, so maybe the first level could add the debug symbols, the second could disable LTO and the third could disable optimization and enable the selected sanitizer(s).

makemake.sh

tdulcet · 2025-11-05T12:17:33Z

makemake.sh


+MAKE_ARGS+=(-j "$CPU_THREADS")
+
+CCVER=$($CC --version | head -n1)


This is somewhat brittle, so we may want to do this more similar to how we detect the architecture on Windows and compile/run a simple C program instead, which would allow us to check for the various preprocessor defines.

This is good unless we encounter gcc 2.95, which is going to break for some other reason anyways. We are not detecting anything about the system here: we are asking about the compiler specifically, not even the include files, so let's trust the compiler on it.

tdulcet · 2025-11-05T12:25:55Z

makemake.sh

 cat <<EOF >Makefile
 CC ?= gcc
-CFLAGS = -fdiagnostics-color -Wall -g -O3 -flto # =auto
+CFLAGS = -fdiagnostics-color -Wall -g -O3 -flto=auto


Bumping the minimum compiler version is not really an option, as I believe Ernst wanted the only requirement to be a C99 compiler on Unix.

Co-authored-by: Teal Dulcet <[email protected]>

tdulcet · 2025-11-21T21:49:24Z

.github/workflows/ci.yml

+        export XCFLAGS=-fsanitize=address,undefined
+        bash -e -o pipefail -- makemake.sh use_hwloc debug


Since it is only used for the makemake.sh script, the export is unnecessary:

Suggested change

export XCFLAGS=-fsanitize=address,undefined

bash -e -o pipefail -- makemake.sh use_hwloc debug

XCFLAGS=-fsanitize=address,undefined bash -e -o pipefail -- makemake.sh use_hwloc debug

Same below.

tdulcet · 2025-11-21T21:50:54Z

makemake.sh

+mapfile -t C_ARGS_ADD <<< "$XCFLAGS"
+C_ARGS+=("${C_ARGS_ADD[@]}")


The XCFLAGS variable is only ever going to be a single line, so splitting it by lines and converting it to an array is unnecessary:

Suggested change

mapfile -t C_ARGS_ADD <<< "$XCFLAGS"

C_ARGS+=("${C_ARGS_ADD[@]}")

if [[ -n $XCFLAGS ]]; then

C_ARGS+=("$XCFLAGS")

fi

tdulcet · 2025-11-21T21:55:45Z

makemake.sh

+else
+	C_ARGS+=("${C_ARGS_OPT[@]}")
+fi
+
+if ((DEBUG < 2)); then
+	C_ARGS+=("${C_ARGS_LTO[@]}")
+fi


I would not recommend mixing debug with LTO, especially when using the sanitizers:

Suggested change

else

C_ARGS+=("${C_ARGS_OPT[@]}")

fi

if ((DEBUG < 2)); then

C_ARGS+=("${C_ARGS_LTO[@]}")

fi

else

C_ARGS+=("${C_ARGS_OPT[@]}" "${C_ARGS_LTO[@]}")

fi

Using LTO can also disable some compiler warnings.

tdulcet · 2025-11-22T13:30:26Z

makemake.sh

-	ARGS+=($output)
+	mapfile -t output_array <<<"$output"
+	C_ARGS+=("${output_array[@]}")


This is not equivalent, as mapfile splits by newlines, but this was splitting by all whitespace. Since we are pasting the value in the Makefile anyway, it should be fine to just quote the variable and disable the splitting, if you want to resolve the ShellCheck warning.

tdulcet · 2025-11-22T13:34:50Z

makemake.sh

 		LD_ARGS+=(-lrt)
 	fi
+	# Only supported on ELF platforms, but is harmless on not-mac since the compiler is not ancient:
+	CC_ARGS+=(-gz)


Did you mean:

Suggested change

CC_ARGS+=(-gz)

C_ARGS+=(-gz)

tdulcet · 2025-11-22T13:39:27Z

makemake.sh

+	\$(CC) \$(CFLAGS) \$(CPPFLAGS) -c -DFACTOR_STANDALONE $WORDS -DTRYQ=4 \$<
 %.o: ../src/%.c
-	\$(CC) \$(CFLAGS) \$(CPPFLAGS) -c ${ARGS[@]} \$<
+	\$(CC) \$(CFLAGS) \$(CPPFLAGS) -c \$<


This is slightly different, as we are now passing the arguments in ARGS (now C_ARGS) everywhere in the Makefile that CFLAGS is used. Notably, this would include additional flags when linking, which may need to be tested.

tdulcet requested changes Nov 4, 2025

View reviewed changes

tdulcet added the enhancement New feature or request label Nov 4, 2025

Artoria2e5 and others added 3 commits November 5, 2025 01:48

Update .github/workflows/ci.yml

5c19759

Co-authored-by: Teal Dulcet <[email protected]>

Boldly move ci debug things into makemake

7bad9a8

Now we don't have to maintain the lto regex at all! Also: * balance brackets on case, because someone complained :) * lto flag selection in makemake by compiler ver

Make shellcheck happy

2c6d713

tdulcet requested changes Nov 5, 2025

View reviewed changes

Artoria2e5 and others added 2 commits November 6, 2025 09:59

ci: pass sanitizer in via env var instead

9651738

Update makemake.sh (will squash, promise)

b5cf1ee

Co-authored-by: Teal Dulcet <[email protected]>

Artoria2e5 requested a review from tdulcet November 6, 2025 01:59

tdulcet force-pushed the main branch 2 times, most recently from 07df4ce to 53cf9dc Compare November 20, 2025 17:32

tdulcet requested changes Nov 22, 2025

View reviewed changes


		MAKE_ARGS+=(-j "$CPU_THREADS")

		CCVER=$($CC --version \| head -n1)

		export XCFLAGS=-fsanitize=address,undefined
		bash -e -o pipefail -- makemake.sh use_hwloc debug

	export XCFLAGS=-fsanitize=address,undefined
	bash -e -o pipefail -- makemake.sh use_hwloc debug
	XCFLAGS=-fsanitize=address,undefined bash -e -o pipefail -- makemake.sh use_hwloc debug

		mapfile -t C_ARGS_ADD <<< "$XCFLAGS"
		C_ARGS+=("${C_ARGS_ADD[@]}")

Parallel lto codegen #56

Are you sure you want to change the base?

Parallel lto codegen #56

Uh oh!

Conversation

Artoria2e5 commented Nov 4, 2025

Uh oh!

tdulcet left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Artoria2e5 Nov 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Artoria2e5 Nov 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Artoria2e5 Nov 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Artoria2e5 Nov 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tdulcet Nov 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Artoria2e5 Nov 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Artoria2e5 Nov 4, 2025 •

edited

Loading

Artoria2e5 Nov 5, 2025 •

edited

Loading

Artoria2e5 Nov 6, 2025 •

edited

Loading

Artoria2e5 Nov 6, 2025 •

edited

Loading

tdulcet Nov 20, 2025 •

edited

Loading

Artoria2e5 Nov 5, 2025 •

edited

Loading