Skip to content
Merged

NTT #340

Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
116 commits
Select commit Hold shift + click to select a range
af7aa90
Minor optimization. Compute weights while waiting for carries to be …
gwoltman Apr 7, 2025
fa33192
Fixed bug were default TAIL_KERNELS in C code did mot match default i…
gwoltman Apr 7, 2025
dac928d
Eliminated BCAST=1 and UNROLL_W=3 hack. Replaced with a 3 digit FFT …
gwoltman Apr 9, 2025
b1deb06
Changed FFT spec from a single-digit code (0,1,2,3) to a 3-digit WMH …
gwoltman Jun 8, 2025
47ef597
Added -tune code to time all important -use options!
gwoltman Jun 10, 2025
4de83d0
Added tune code for IN_WG,IN_SIZEX,OUT_WG,OUT_SIZEX
gwoltman Jun 14, 2025
4826949
Solved the nVidia finish causing a CPU busy wait. The queue contains…
gwoltman Jun 23, 2025
757c3f7
At Mihai's suggestion, use a portable method to sleep N microseconds.
gwoltman Jun 23, 2025
d198fd5
Fixed memory allocation bug with MIDDLE=4, PAD=512
gwoltman Jun 30, 2025
fd59830
Changed results.txt to results-N.txt -- the same naming scheme as wor…
gwoltman Jul 5, 2025
0e25bbf
Fixed bug where an AMD-only FFT variant (000) was chosen whenever no …
gwoltman Jul 9, 2025
7afe0a5
Don't output tedious "missing BPW" message when that is expected
gwoltman Jul 11, 2025
09fea79
Fixed compiler warning for Windows compiler (long vs. long long diffe…
gwoltman Jul 11, 2025
4e3c304
Merge branch 'preda:master' into master
gwoltman Aug 5, 2025
60c991d
Corrected comments on UNROLL_W. Simplified biglit1 comment.
gwoltman Aug 5, 2025
957f0cb
Fixed bug in -ctune where it used a BCAST FFT variant which is not su…
gwoltman Aug 5, 2025
05d0d73
Initialize all varibles before initializing the thread. Prevents a r…
gwoltman Aug 12, 2025
10938a1
Massive changes to support NTTs and hybrid FFTs. First cut, much cle…
gwoltman Sep 14, 2025
1e63390
Fixed typos in GF31 + GF61 weights shift calculations
gwoltman Sep 14, 2025
38d3ae1
Improved GF31 and GF61 math
gwoltman Sep 17, 2025
9f4ca8c
Minor tweaks and cleanup in carryutil. Some alternate GF61 math impl…
gwoltman Sep 18, 2025
c02646c
Allow GPU to return integer results that are not strictly in the big …
gwoltman Sep 23, 2025
70a0276
Wholsale re-organization of carryutil routines. New feature allows f…
gwoltman Sep 23, 2025
6f65bb7
Undid one of the carryStepSignedSloppy changes.
gwoltman Sep 23, 2025
a71ca88
Added csqTrig and ccubeTrig for NTTs. Did not prove to be helpful on…
gwoltman Sep 24, 2025
f9be4ba
Eliminate chainmul in middle. It is faster on TitanV. Not sure if w…
gwoltman Sep 24, 2025
ecd1832
Fixed assert
gwoltman Sep 25, 2025
52dfa84
Improved GF61 wideMul and csq. Removed unused routines.
gwoltman Sep 26, 2025
4cafe0e
Removed some dead code. Tweaked a few comments.
gwoltman Sep 26, 2025
a0d29d7
Fixed TailMul bug with single_wide kernels. The #define for SINGLE_W…
gwoltman Sep 26, 2025
e64ee76
Delayed negations and mul_t4 from pairSq into onePairSq.
gwoltman Sep 26, 2025
101e160
Use csqq to save one modular GF61 reduction in onePairSquare
gwoltman Sep 26, 2025
f3c334f
Fixed optimization bug in GF31*GF61 NTT where BPW was low (less than 23)
gwoltman Sep 30, 2025
acbd0a2
Fixed typo bug in FP32 code
gwoltman Oct 3, 2025
58e18e4
Fixed next bug in FP32 code.
gwoltman Oct 4, 2025
3ccb6bf
Make all FFTs and NTTs accessible from FFT specs in a single executab…
gwoltman Oct 4, 2025
95778ec
Added help text for -tune options.
gwoltman Oct 4, 2025
a13998e
Fixed bug in conversion from 64-bit CPU words to 32-bit GPU words
gwoltman Oct 4, 2025
f57a0ba
Only tune variant 202 for FP32 (may change that later).
gwoltman Oct 5, 2025
1edd3e1
Have LL tests obey the -log command line argument
gwoltman Oct 5, 2025
167a804
Added -log 1000000 to -tune output
gwoltman Oct 5, 2025
5a5748e
Fixed typo in -workers 2 suggestion
gwoltman Oct 6, 2025
95cbce3
Fixed bug in GF61 csqq that I don't think affected any existing code …
gwoltman Oct 7, 2025
c5d2ce4
Improved FP32+GF61 carry propagation. Minor tweaks to all FP carry p…
gwoltman Oct 7, 2025
15e4737
Remove extraneous comma after ROEavg output. Dropped logging ROE inf…
gwoltman Oct 7, 2025
ee46f6d
Save one and instruction in CUDA compile of M31 + M61 carry propagation
gwoltman Oct 7, 2025
90ba20d
Don't output -log and -workers tune suggestions if they are already set.
gwoltman Oct 8, 2025
6ea19b0
For completeness, added tthe hybrid FP32+GF31+GF61 FFT. It may not b…
gwoltman Oct 9, 2025
9019694
Improved M31+M61 carry propagation by using get_balanced_Z61
gwoltman Oct 9, 2025
0eff38c
In M31*M61 carry propagation, construct 128-bit v value differently. …
gwoltman Oct 9, 2025
f29abc4
Finalized decision on using i128 data type to implement i96 math. Cl…
gwoltman Oct 9, 2025
c07ed34
Added quick=n to -tune parameters for config.txt options testing. n …
gwoltman Oct 10, 2025
ea27080
Changed the way sleep time is computed when Queue is full. A shorter…
gwoltman Oct 12, 2025
4cd8a38
Deleted code for slower options. Faster FP32+M31+M61 carry propagation.
gwoltman Oct 13, 2025
9f20b38
More cleanup of less efficient carry propagation options.
gwoltman Oct 13, 2025
28e683a
Added JSON text for type 4 hybrid FFT
gwoltman Oct 15, 2025
da1375a
Fixed typo in last fix
gwoltman Oct 15, 2025
b744c6e
Faster complex mul for GF61. Attempted inline PTX code with disappoi…
gwoltman Oct 15, 2025
bd5e35d
Changed the min BPW for FP32-only FFTs from 3.0 to 1.0.
gwoltman Oct 15, 2025
30cfce8
Fixed crash during setup of middle=1 when width and height are 256. …
gwoltman Oct 15, 2025
ce3c7a2
Added assert at Mihai suggestion
gwoltman Oct 16, 2025
cc809c1
Prettier parts code at Mihai's suggestion
gwoltman Oct 16, 2025
a93af50
Deleted test code using signed GF61 intermediates
gwoltman Oct 16, 2025
2383424
Fixed bug in ROE calculations for M31+M61 NTT
gwoltman Oct 16, 2025
09e0c23
Pass FFT_TYPE to OpenCL code -- makes carryfused, fftp, carryutil, et…
gwoltman Oct 16, 2025
0d1a624
Compute FRAC_BITS_BIGSTEP in openCL code (like GF31 and GF61 NTTs do)…
gwoltman Oct 16, 2025
6204f3d
Faster startup when beginning a new PRP test (long overdue for develo…
gwoltman Oct 16, 2025
5fa3a83
Saved one negation in hybrid carry propagation
gwoltman Oct 17, 2025
25522f7
Filled out the max BPW table. Allow Z=6 for NTTs without any warning.
gwoltman Oct 17, 2025
fd0a895
Eliminate fixed definition of SLOPPY_MAXBPW. C code now passes in th…
gwoltman Oct 18, 2025
bc09cc8
Split a long -tune output onto two lines.
gwoltman Oct 18, 2025
3ff265c
Return correct type from mad32
gwoltman Oct 19, 2025
91554a6
Fuxed bug where MUL3 could overflow an i32 in the (unsupported) M31-o…
gwoltman Oct 20, 2025
a9b4904
Elimiinated i96_mul. Two adds should be at least as fast as a mul by 3.
gwoltman Oct 20, 2025
0a108d9
Made a set of routines to support i128 and u128. Needed because Inte…
gwoltman Oct 22, 2025
95ea128
Detect nVidia GPUs. Set HAS_PTX.
gwoltman Oct 22, 2025
bcd3c35
Added some changes for an MSYS build
gwoltman Oct 22, 2025
da46795
Another fix to get prpll linking under MSYS2
gwoltman Oct 22, 2025
5057c13
Wrote a PTX version of mad64 that is faster (on TitanV) than both old…
gwoltman Oct 22, 2025
4566c26
Implemented i96 as three 32-bit quantities with PTX asm.
gwoltman Oct 24, 2025
ec16578
Ignore first call to setSquareTime. First timings are inaccurate due…
gwoltman Oct 25, 2025
d83506d
Change maxBpw calculation as only some FFT types support both 32 and …
gwoltman Oct 26, 2025
b569083
Allow leadIn/leadOut to be used across a modMul call. A very minor o…
gwoltman Oct 26, 2025
fedb6e4
Further fixes to ignoring first setSquareTime. Previous fix only wor…
gwoltman Oct 26, 2025
aa8af68
Corrected comments on max values during a GF61 cmul
gwoltman Oct 26, 2025
c006247
More comments corrections
gwoltman Oct 26, 2025
6fc7e93
Use user specified quick value and exponent to adjust number of itera…
gwoltman Oct 27, 2025
81b3f0c
Minor tweak so that M31*M61 NTTs time a few more iterations for wavef…
gwoltman Oct 27, 2025
1a591da
Changed the wording in some -tune messages
gwoltman Oct 30, 2025
365f492
Reduce default exponent for config tuning FP32. MaxExp depends on va…
gwoltman Nov 1, 2025
df87784
Fixed AMD asm problem handling BPW between 31 and 32
gwoltman Nov 2, 2025
48c9208
Changed from executing smallest exponent in worktodo.txt to requiring…
gwoltman Nov 6, 2025
7c3571e
Improved GF31 reduction mod M31
gwoltman Nov 7, 2025
6977bba
More mad32
gwoltman Nov 7, 2025
f25eace
Improved cache locality for M31+M61 NTTs. Helpful on machines with a…
gwoltman Nov 7, 2025
197928e
Made the new modM31 macro a -use option. The new code is 1% faster o…
gwoltman Nov 8, 2025
b6b6452
Fixed proof generation/verification bug introduced with M31/M61 cache…
gwoltman Nov 8, 2025
33215a9
Use mad32 instructions in csqTrig and ccubeTrig. Should help TABMUL_…
gwoltman Nov 8, 2025
c08ad6a
Added third nontemporal option. I don't know if it will be useful.
gwoltman Nov 9, 2025
6f4b7aa
Fixed compile-time bug in non-PTX mad64 routine
gwoltman Nov 9, 2025
9c6ff47
Very minor optimization. Add rarely used FFTW kernel to cache_group.
gwoltman Nov 9, 2025
9076198
Check for needed builtins in variant 0 FFTs
gwoltman Nov 10, 2025
a7bb209
Wrote asm routines for prefetching on nVidia. They did not help. Ma…
gwoltman Nov 15, 2025
5e6ad1b
Fixed lint issues in MINGW64 where u64 is unsigned long long rather t…
gwoltman Nov 15, 2025
fa249e1
More MINGW-64 changes where a long is 32 bits vs. everywhere else a l…
gwoltman Nov 16, 2025
3554909
Changed all usages of HAS_PTX to test for needed level of CUDA suppor…
gwoltman Nov 16, 2025
d2d52aa
Fixed memory leak enqueuing marker
gwoltman Nov 16, 2025
76e4276
Fix for compile error on Windows AMD GPUs trying to access amdgcn bui…
gwoltman Nov 18, 2025
f3326d3
Close .cert file before trying to delete the file. Windows file dele…
gwoltman Nov 18, 2025
553ee7c
Fixed compile problem on FP32+GF61 hybrid FFTs.
gwoltman Nov 19, 2025
c923698
Fixed two more hybrid FFT OVERLOAD errors
gwoltman Nov 19, 2025
6bc8b43
-tune spews warning messages on Windows with an AMD GPU because clang…
gwoltman Nov 19, 2025
c363095
Added INPLACE=1 option. Documentation says INPLACE=2 will choose a g…
gwoltman Dec 10, 2025
46386d0
Fixed tuning INPLACE bug
gwoltman Dec 10, 2025
4db49dc
Don't allow testing non-prime exponents. They sometimes raise excess…
gwoltman Dec 12, 2025
09ed544
Added no fp32 tune option (a Windows user found OpenCL compiler choki…
gwoltman Dec 12, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 6 additions & 2 deletions Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -19,8 +19,12 @@ else
CXX = g++
endif

COMMON_FLAGS = -Wall -std=c++20
# -static-libstdc++ -static-libgcc
ifneq ($(findstring MINGW, $(HOST_OS)), MINGW)
COMMON_FLAGS = -Wall -std=c++20 -static-libstdc++ -static-libgcc
else
# For mingw-64 use this:
COMMON_FLAGS = -Wall -std=c++20 -static-libstdc++ -static-libgcc -static
endif
# -fext-numeric-literals

ifeq ($(HOST_OS), Darwin)
Expand Down
75 changes: 37 additions & 38 deletions src/Args.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -156,6 +156,7 @@ named "config.txt" in the prpll run directory.
-prp <exponent> : run a single PRP test and exit, ignoring worktodo.txt
-ll <exponent> : run a single LL test and exit, ignoring worktodo.txt
-verify <file> : verify PRP-proof contained in <file>
-smallest : work on smallest exponent in worktodo.txt rather than the first exponent in worktodo.txt
-proof <power> : generate proof of power <power> (default: optimal depending on exponent).
A lower power reduces disk space requirements but increases the verification cost.
A higher power increases disk usage a lot.
Expand Down Expand Up @@ -185,6 +186,8 @@ named "config.txt" in the prpll run directory.
2 = calculate from scratch, no memory read
1 = calculate using one complex multiply from cached memory and uncached memory
0 = read trig values from memory
-use INPLACE=n : Perform tranforms in-place. Great if the reduced memory usage fits in the GPU's L2 cache.
0 = not in-place, 1 = nVidia friendly access pattern, 2 = AMD friendly access pattern.
-use PAD=<val> : insert pad bytes to possibly improve memory access patterns. Val is number bytes to pad.
-use MIDDLE_IN_LDS_TRANSPOSE=0|1 : Transpose values in local memory before writing to global memory
-use MIDDLE_OUT_LDS_TRANSPOSE=0|1 : Transpose values in local memory before writing to global memory
Expand All @@ -194,23 +197,14 @@ named "config.txt" in the prpll run directory.

-use DEBUG : enable asserts in OpenCL kernels (slow, developers)

-tune : measures the speed of the FFTs specified in -fft <spec> to find the best FFT for each exponent.

-ctune <configs> : finds the best configuration for each FFT specified in -fft <spec>.
Prints the results in a form that can be incorporated in config.txt
-fft 6.5M -ctune "OUT_SIZEX=32,8;OUT_WG=64,128,256"

It is possible to specify -ctune multiple times on the same command in order to define multiple
sets of parameters to be combined, e.g.:
-ctune "IN_WG=256,128,64" -ctune "OUT_WG=256,64;OUT_SIZEX=32,16,8"
which would try only 8 combinations among those two sets.

The tunable parameters (with the default value emphasized) are:
IN_WG, OUT_WG: 64, 128, *256*
IN_SIZEX, OUT_SIZEX: 4, 8, 16, *32*
UNROLL_W: *0*, 1
UNROLL_H: 0, 1

-tune <options> : Looks for best settings to include in config.txt. Times many FFTs to find fastest one to test exponents -- written to tune.txt.
An -fft <spec> can be given on the command line to limit which FFTs are timed.
Options are not required. If present, the options are a comma separated list from below.
noconfig - Skip timings to find best config.txt settings
fp64 - Tune for settings that affect FP64 FFTs. Time FP64 FFTs for tune.txt.
ntt - Tune for settings that affect integer NTTs. Time integer NTTs for tune.txt.
minexp=<val> - Time FFTs to find the best one for exponents greater than <val>.
maxexp=<val> - Time FFTs to find the best one for exponents less than <val>.
-device <N> : select the GPU at position N in the list of devices
-uid <UID> : select the GPU with the given UID (on ROCm/AMDGPU, Linux)
-pci <BDF> : select the GPU with the given PCI BDF, e.g. "0c:00.0"
Expand All @@ -236,31 +230,34 @@ Device selection : use one of -uid <UID>, -pci <BDF>, -device <N>, see the list
);

}
printf("\nFFT Configurations (specify with -fft <width>:<middle>:<height> from the set below):\n"
printf("\nFFT Configurations (specify with -fft <type>:<width>:<middle>:<height> from the set below):\n"
" Size MaxExp BPW FFT\n");

vector<FFTShape> configs = FFTShape::allShapes();
configs.push_back(configs.front()); // dummy guard for the loop below.
string variants;
u32 activeSize = 0;
double maxBpw = 0;
for (auto c : configs) {
if (c.size() != activeSize) {
if (!variants.empty()) {
printf("%5s %7.2fM %.2f %s\n",
numberK(activeSize).c_str(),
// activeSize * FFTShape::MIN_BPW / 1'000'000,
activeSize * maxBpw / 1'000'000.0,
maxBpw,
variants.c_str());
variants.clear();
float maxBpw = 0;
string variants;
for (enum FFT_TYPES type : {FFT64, FFT3161, FFT3261, FFT61}) {
for (auto c : configs) {
if (c.fft_type != type) continue;
if (c.size() != activeSize) {
if (!variants.empty()) {
printf("%5s %7.2fM %.2f %s\n",
numberK(activeSize).c_str(),
// activeSize * FFTShape::MIN_BPW / 1'000'000,
activeSize * maxBpw / 1'000'000.0,
maxBpw,
variants.c_str());
variants.clear();
}
activeSize = c.size();
maxBpw = 0;
}
activeSize = c.size();
maxBpw = 0;
maxBpw = max(maxBpw, c.maxBpw());
if (!variants.empty()) { variants.push_back(','); }
variants += c.spec();
}
maxBpw = max(maxBpw, c.maxBpw());
if (!variants.empty()) { variants.push_back(','); }
variants += c.spec();
}
}

Expand Down Expand Up @@ -295,9 +292,10 @@ void Args::parse(const string& line) {
log("-info expects an FFT spec, e.g. -info 1K:13:256\n");
throw "-info <fft>";
}
log(" FFT | BPW | Max exp (M)\n");
log(" FFT | BPW | Max exp (M)\n");
for (const FFTShape& shape : FFTShape::multiSpec(s)) {
for (u32 variant = 0; variant <= LAST_VARIANT; variant = next_variant (variant)) {
if (variant != LAST_VARIANT && shape.fft_type != FFT64) continue;
FFTConfig fft{shape, variant, CARRY_AUTO};
log("%12s | %.2f | %5.1f\n", fft.spec().c_str(), fft.maxBpw(), fft.maxExp() / 1'000'000.0);
}
Expand All @@ -310,8 +308,8 @@ void Args::parse(const string& line) {
assert(s.empty());
logROE = true;
} else if (key == "-tune") {
assert(s.empty());
doTune = true;
if (!s.empty()) { tune = s; }
} else if (key == "-ctune") {
doCtune = true;
if (!s.empty()) { ctune.push_back(s); }
Expand Down Expand Up @@ -372,6 +370,7 @@ void Args::parse(const string& line) {
else if (key == "-iters") { iters = stoi(s); assert(iters && (iters % 10000 == 0)); }
else if (key == "-prp" || key == "-PRP") { prpExp = stoll(s); }
else if (key == "-ll" || key == "-LL") { llExp = stoll(s); }
else if (key == "-smallest") { smallest = true; }
else if (key == "-fft") { fftSpec = s; }
else if (key == "-dump") { dump = s; }
else if (key == "-user") { user = s; }
Expand Down
4 changes: 3 additions & 1 deletion src/Args.h
Original file line number Diff line number Diff line change
Expand Up @@ -43,6 +43,7 @@ class Args {
string uid;
string verifyPath;

string tune;
vector<string> ctune;

bool doCtune{};
Expand All @@ -53,14 +54,15 @@ class Args {

std::map<std::string, std::string> flags;
std::map<std::string, vector<KeyVal>> perFftConfig;

int device = 0;

bool safeMath = true;
bool clean = true;
bool verbose = false;
bool useCache = false;
bool profile = false;
bool smallest = false;

fs::path masterDir;
fs::path proofResultDir = "proof";
Expand Down
5 changes: 3 additions & 2 deletions src/Background.h
Original file line number Diff line number Diff line change
Expand Up @@ -16,10 +16,10 @@
class Background {
unsigned maxSize;
std::deque<std::function<void()> > tasks;
std::jthread thread;
std::mutex mut;
std::condition_variable cond;
bool stopRequested{};
bool stopRequested;
std::jthread thread;

void run() {
std::function<void()> task;
Expand Down Expand Up @@ -59,6 +59,7 @@ class Background {
public:
Background(unsigned size = 2) :
maxSize{size},
stopRequested(false),
thread{&Background::run, this} {
}

Expand Down
2 changes: 1 addition & 1 deletion src/Buffer.h
Original file line number Diff line number Diff line change
Expand Up @@ -26,7 +26,7 @@ class Buffer {

Queue* queue;
TimeInfo *tInfo;

Buffer(cl_context context, TimeInfo *tInfo, Queue* queue, size_t size, unsigned flags, const T* ptr = nullptr)
: ptr{size == 0 ? NULL : makeBuf_(context, flags, size * sizeof(T), ptr)}
, size{size}
Expand Down
Loading
Loading