[OVM GPU] direct to APC trace gen #50

qwang98 · 2025-11-17T15:49:40Z

WIP. Currently just a prototype for ALU chip on the OVM side. If we were to go down this route, a similar modification for all chips is needed.

Passes guest_prove_simple NON-DETERMINISTICALLY in some runs, likely due to a bug that's more on the Powdr end than on OVM end: powdr-labs/powdr#3458. Keccak test entirely fails, likely due to the same bug.

Current draft represents minimum change to the chip implementations, by pushing API changes as much as possible to RowSliceNew. The only changes to chips are:

Create the first APC RowSlice according to additional injected device inputs from CPU.
Conditionally skip modifications to dummy periphery (e.g. add_count) whether row_slice.is_apc is true.

Otherwise, most APIs are simply replaced with their _new version, without any changes to the arguments.

TODOs:

Pass Fibo test with identity matrix for subs for APC dummy trace gen and non-APC trace gen
Pass Fibo test with direct write to APC trace
Pass Keccak test
Benchmark ALU chip and decide whether to continue

There's nothing static added to this yet, as currently everything (including whether we take the APC witgen path vs original instruction path for ALU chip for current call to genenerate_proving_ctx) is run time values.

qwang98 · 2025-11-17T15:50:26Z

crates/circuits/primitives/cuda/include/primitives/trace_access.h

+#define COL_WRITE_VALUE_APC(APC_ROW, STRUCT, FIELD, VALUE, SUB, OFFSET) {
+    if SUB[COL_INDEX(STRUCT, FIELD) + OFFSET] != UINT32_MAX {
+        (APC_ROW).write(
+            COL_INDEX(STRUCT, FIELD) + OFFSET,
+            VALUE
+        );
+    }
+}


Only write if not optimized away based on SUB.

qwang98 · 2025-11-17T15:50:51Z

extensions/native/circuit/cuda/include/native/adapters/alu_native_adapter.cuh

+        if !apc_row.is_null() {
+            COL_WRITE_VALUE_APC(apc_row, AluNativeAdapterCols, from_state.timestamp, Fp(rec.from_timestamp), sub, offset);
+            COL_WRITE_VALUE_APC(row, AluNativeAdapterCols, a_pointer, Fp::fromRaw(rec.a_ptr), sub, offset);
+            COL_WRITE_VALUE_APC(row, AluNativeAdapterCols, b_pointer, Fp::fromRaw(rec.b), sub, offset);
+            COL_WRITE_VALUE_APC(row, AluNativeAdapterCols, c_pointer, Fp::fromRaw(rec.c), sub, offset);
+
+            // TODO: adapt the rest similar to above


Not complete but just showing how we use COL_WRITE_VALUE_APC.

qwang98 · 2025-11-17T15:52:43Z

extensions/rv32im/circuit/src/base_alu/cuda.rs


 impl Chip<DenseRecordArena, GpuBackend> for Rv32BaseAluChipGpu {
-    fn generate_proving_ctx(&self, arena: DenseRecordArena) -> AirProvingContext<GpuBackend> {
+    fn generate_proving_ctx(&self, arena: DenseRecordArena, d_apc_trace: DeviceMatrix<F>, subs: Option<Vec<Vec<u32>>>, apc_row_index: Option<Vec<u32>>) -> AirProvingContext<GpuBackend> {


d_apc_trace should be supplied by PowdrTraceGenerator::generate_witness as Some variant and set to None for regular trace gen when no APC involved. Similar for subs and apc_row_index.

subs will have the same length as # of dummy columns, where entry is indexed by dummy column index and value is APC index. Value is hardcoded to UINT32_MAX if the dummy column is optimized away in APC.

This vector design is needed because GPU has no native map encoding and because this vector should be not too sparse so we don't waste too much space.

qwang98 · 2025-11-17T15:55:25Z

extensions/rv32im/circuit/cuda/src/alu.cu

+        RowSlice apc_row(d_apc_trace + apc_row_index[idx], height);
+        auto const sub = subs[idx * width]; // offset the subs to the corresponding dummy row


Here we need the sub for a specific dummy row, because each row might be optimized differently within an APC.

The apc_row should also be specific to the dummy row, because dummy trace gen can involve more than one APC row.

Schaeff · 2025-11-17T16:00:25Z

extensions/native/circuit/cuda/include/native/adapters/alu_native_adapter.cuh

-                COL_WRITE_VALUE(row, AluNativeAdapterCols, reads_aux[i].is_immediate, Fp::one());
-                if (i == 0) {
-                    COL_WRITE_VALUE(row, AluNativeAdapterCols, e_as, Fp(AS_IMMEDIATE));
+    __device__ void fill_trace_row(RowSlice row, AluNativeAdapterRecord const &rec, RowSlice apc_row, uint32_t *sub, uint32_t offset) {


What is stopping us from passing row and apc_row as the same argument? Then it removes the conditional in the body? Also COL_WRITE_VALUE_APC and COL_WRITE_VALUE can be unified?

Yeah that's a good point re unifying row and apc_row.

I think main point on the COL_WRITE_VALUE_APC is speed? Basically one will go through subs and the other won't, but this is also at the cost of a larger diff.

Schaeff · 2025-11-17T16:05:02Z

extensions/native/circuit/cuda/include/native/adapters/alu_native_adapter.cuh

+            COL_WRITE_ARRAY(row, AluNativeAdapterCols, write_aux.prev_data, rec.write_aux.prev_data);
+            mem_helper.fill(
+                row.slice_from(COL_INDEX(AluNativeAdapterCols, write_aux)),
+                rec.write_aux.prev_timestamp,
+                rec.from_timestamp + 2
+            );
        }
-
-        COL_WRITE_ARRAY(row, AluNativeAdapterCols, write_aux.prev_data, rec.write_aux.prev_data);
-        mem_helper.fill(
-            row.slice_from(COL_INDEX(AluNativeAdapterCols, write_aux)),
-            rec.write_aux.prev_timestamp,
-            rec.from_timestamp + 2
-        );
+


This also needs to go through the subs

Yes exactly. Haven't modified this yet, but will be a similar style as COL_WRITE_VALUE_APC

Schaeff · 2025-11-17T16:05:28Z

extensions/native/circuit/cuda/include/native/adapters/alu_native_adapter.cuh

+    __device__ void fill_trace_row(RowSlice row, AluNativeAdapterRecord const &rec, RowSlice apc_row, uint32_t *sub, uint32_t offset) {
+        if !apc_row.is_null() {
+            COL_WRITE_VALUE_APC(apc_row, AluNativeAdapterCols, from_state.timestamp, Fp(rec.from_timestamp), sub, offset);
+            COL_WRITE_VALUE_APC(row, AluNativeAdapterCols, a_pointer, Fp::fromRaw(rec.a_ptr), sub, offset);


Suggested change

COL_WRITE_VALUE_APC(row, AluNativeAdapterCols, a_pointer, Fp::fromRaw(rec.a_ptr), sub, offset);

COL_WRITE_VALUE_APC(apc_row, AluNativeAdapterCols, a_pointer, Fp::fromRaw(rec.a_ptr), sub, offset);

crates/circuits/primitives/cuda/include/primitives/histogram.cuh

qwang98 · 2025-11-17T21:17:13Z

crates/circuits/primitives/cuda/include/primitives/trace_access.h

+__device__ __forceinline__ size_t number_of_gaps_in(const uint32_t *sub, size_t len) {
+    size_t gaps = 0;
+#pragma unroll
+    for (size_t i = 0; i < len; ++i) {
+        if (sub[i] == UINT32_MAX) {
+            ++gaps;
+        }
+    }
+    return gaps;
+}


Should return 0 for a sub vector without UINT32_MAX.

qwang98 · 2025-11-17T21:18:23Z

extensions/rv32im/circuit/cuda/src/alu.cu

    uint32_t idx = blockIdx.x * blockDim.x + threadIdx.x;
    RowSlice row(d_trace + idx, height);
    if (idx < d_records.len()) {
        auto const &rec = d_records[idx];
+        // RowSlice apc_row(d_apc_trace + apc_row_index[idx], height);
+        // auto const sub = subs[idx * width]; // offset the subs to the corresponding dummy row
+        uint32_t *sub = subs;


Currently subs is just an identity vector, but will involve a lot more precomputation if we want them to work with APC.

qwang98 · 2025-11-17T21:19:37Z

extensions/rv32im/circuit/cuda/src/alu.cu

-        core.fill_trace_row(row.slice_from(COL_INDEX(Rv32BaseAluCols, core)), rec.core);
+        core.fill_trace_row_new(row.slice_from(COL_INDEX(Rv32BaseAluCols, core) - number_of_gaps_in(sub, sizeof(Rv32BaseAluCols<uint8_t>))), rec.core, sub);


Note that this is a very common style patch. Basically whenever we slice an (APC) row, we now slice it but also "retract" it by the number of gaps in the optimization. For non-APC row, we should "retract" nothing, so should be "backward compatible" with non-APC calls of the function.

Yes maybe there's a more natural way to encode it, but if this works we can improve later

Ideally this would be moved into the slice_from function though, and we need to store the offset in the rowslice

Schaeff · 2025-11-18T01:52:31Z

extensions/rv32im/circuit/src/base_alu/cuda.rs

        let d_records = records.to_device().unwrap();
        let d_trace = DeviceMatrix::<F>::with_capacity(trace_height, trace_width);

+        // TODO: use actual sub not hardcoded


I guess for software it would still use the identity

…ith non-APC but not sure if APC works

qwang98 · 2025-11-18T22:23:01Z

crates/circuits/primitives/cuda/include/primitives/trace_access.h

+struct RowSliceNew {
+    Fp *ptr;
+    size_t stride;
+    size_t optimized_offset;
+    size_t dummy_offset;


Added this new RowSliceNew struct that stores the offsets.

optimized_offset is the smaller one, basically the cumulative dummy_offset subtracted by gap.

dummy_offset is the larger one, basically the cumulative COL_INDEX of original columns.

qwang98 · 2025-11-18T22:25:10Z

crates/circuits/primitives/cuda/include/primitives/trace_access.h

+    __device__ __forceinline__ void write_array_new(size_t column_index, size_t length, const T *values, const uint32_t *sub)
+        const {
+#pragma unroll
+        for (size_t i = 0; i < length; i++) {
+            if (sub[i] != UINT32_MAX) {
+                ptr[(column_index + i) * stride] = values[i];
+            }
+        }
+    }


Might need to update this as well but currently don't see a bug.

qwang98 · 2025-11-18T22:29:06Z

crates/circuits/primitives/cuda/include/primitives/trace_access.h

+/// Conditionally write a single value into `FIELD` based on APC sub-columns.
+#define COL_WRITE_VALUE_NEW(ROW, STRUCT, FIELD, VALUE, SUB)                                         \
+    do {                                                                                            \
+        const size_t _col_idx = COL_INDEX(STRUCT, FIELD);                                           \
+        const auto _apc_idx = (SUB)[_col_idx + ROW.dummy_offset];                                                      \
+        if (_apc_idx != UINT32_MAX) {                                                               \
+            (ROW).write(_apc_idx - ROW.optimized_offset, VALUE);                                                           \
+        }                                                                                           \
+    } while (0)


This is the other key update that computes relative sub from absolute subs:

Convert relative _col_idx to absolute sub index by adding ROW.dummy_offset, because the index is based on original dummy offset.

_apc_idx is absolute APC index.

Because now we also row slice APC trace by post-optimization offset, we need relative APC index when writing, and that's why we subtract ROW.optimized_offset from absolute APC index to obtain the relative APC index.

This should be fully backward compatible, because for non-APC use case, optimized_offset is the same as dummy_offset (as there are no gaps), so adding and subtracting by the same thing is equivalent to using the identity matrix.

qwang98 · 2025-11-18T22:31:13Z

extensions/rv32im/circuit/cuda/src/alu.cu

-    RowSlice row(d_trace + idx, height);
+    RowSliceNew row(d_trace + idx / calls_per_apc_row, height, 0, 0); // we need to slice to the correct APC row, but if non-APC it's dividing by 1 and therefore the same idx


Initialize row with zero for both optimized_offset and dummy_offset.

In the case of non-APC row, calls_per_apc_row is hardcoded to 1, so it's basically the same as d_trace + idx.

In the case of APC row, we integer divide by calls_per_apc_row, so we access to the correct APC slice.

qwang98 · 2025-11-18T22:32:34Z

extensions/rv32im/circuit/cuda/src/alu.cu

        auto const &rec = d_records[idx];
+        // RowSlice apc_row(d_apc_trace + apc_row_index[idx], height);
+        // auto const sub = subs[idx * width]; // offset the subs to the corresponding dummy row
+        uint32_t *sub = &subs[(idx % calls_per_apc_row) * width]; // dummy width


We need the correct sub for the instruction, which depends on which record it is in.

All sub should have the same width across instructions, which is convenient.

qwang98 · 2025-11-18T22:35:43Z

crates/circuits/primitives/cuda/include/primitives/trace_access.h

+    __device__ __forceinline__ RowSliceNew slice_from(size_t column_index, uint32_t gap) const {
+        return RowSliceNew(ptr + (column_index - gap) * stride, stride, optimized_offset + column_index - gap, dummy_offset + column_index);
+    }


This is another key update:

In case of APC, we advance by column_index - gap, and accumulate optimized_offset by the post optimization width (column_index - gap) whereas we accumulate dummy_offset by the pre optimization width (column_index).

In case of non-APC, we advance by column_index (because gap is 0). For the same reason, we advance optimized_offset and dummy_offset by column_index.

I wonder if we can make this function take the same arguments as originally (only the column_index). Could it figure out the gap from the substitutions if it also had that internally? That would minimize the changes needed in the original chips.

Schaeff · 2025-11-19T12:38:18Z

crates/circuits/primitives/cuda/include/primitives/trace_access.h

 /// Write a single value into `FIELD` of struct `STRUCT<T>` at a given row.
 #define COL_WRITE_VALUE(ROW, STRUCT, FIELD, VALUE) (ROW).write(COL_INDEX(STRUCT, FIELD), VALUE)

+/// Conditionally write a single value into `FIELD` based on APC sub-columns.
+#define COL_WRITE_VALUE_NEW(ROW, STRUCT, FIELD, VALUE, SUB)                                         \
+    do {                                                                                            \
+        const size_t _col_idx = COL_INDEX(STRUCT, FIELD);                                           \
+        const auto _apc_idx = (SUB)[_col_idx + ROW.dummy_offset];                                                      \


Same here: if the sub is stored in the RowSlice, we can keep the same interface for COL_WRITE_VALUE etc

Good point! :)

Will implement once the current version works end to end, so that other chips will have minimal changes.

This is partially implemented, but I need to clean up by removing SUB from some APIs.

Schaeff · 2025-11-19T12:39:19Z

extensions/rv32im/circuit/cuda/include/rv32im/adapters/alu.cuh

        );
    }
+
+    __device__ void fill_trace_row_new(RowSliceNew row, Rv32BaseAluAdapterRecord record, uint32_t *sub) {


then here we don't have to pass the subs everywhere and the whole thing would be unchanged?

Same as above.

… simple example works

qwang98 · 2025-11-26T10:29:38Z

Despite what the latest commit message suggests, it's now confirmed that the memory alignment error is non deterministic depending on the run, which makes it even more perplexing, though it also suggests that main trace is actually correct.

However, the "fix" at least works for some runs, which also means that it's not a "full fix" yet...

The error only happens during RangeTupleChecker trace gen (which is very weird as it doesn't even happen in ALU), and a few further debugging ideas include:

Compare trace gen inputs when the error happens vs not across different runs.
[Update - there are no differences]
Compare generate_proving_ctx chip order when the error happens vs not across different runs.
[Update - It's the same order]
It seems that when I comment out ALU chip related skips for dummy trace gen, the code works properly, so there might be some "chip order matching" issue between VmChipComplex and the new order without ALU. This might also relate to the dummy VmChipComplex, though I still don't see an immediate reason this is related to CUDA memory address misalignment...
[Update - I don't think it's related to the dummy chip complex, because it's only used for generating dummy traces, which are only generated if we have record arenas for them. The chip generation order is always the same for real or dummy chip complex, it's just that not all chips have dummy record arenas to generate trace for.]

Here's the very cryptic error:

thread 'tests::guest_prove_simple' (1400376) panicked at /home/steve/openvm/crates/circuits/primitives/src/range_tuple/cuda.rs:57:63:
called `Result::unwrap()` on an `Err` value: CudaError { code: 716, name: "cudaErrorMisalignedAddress", message: "misaligned address" }
stack backtrace:
   0: __rustc::rust_begin_unwind
   1: core::panicking::panic_fmt
   2: core::result::unwrap_failed
   3: <openvm_circuit_primitives::range_tuple::cuda::RangeTupleCheckerChipGPU<_> as openvm_stark_backend::chip::Chip<RA,openvm_cuda_backend::prover_backend::GpuBackend>>::generate_proving_ctx
   4: <alloc::sync::Arc<C> as openvm_stark_backend::chip::Chip<R,PB>>::generate_proving_ctx
   5: <core::iter::adapters::map::Map<I,F> as core::iter::traits::iterator::Iterator>::try_fold
   6: <core::iter::adapters::chain::Chain<A,B> as core::iter::traits::iterator::Iterator>::try_fold
   7: <alloc::vec::Vec<T> as alloc::vec::spec_from_iter::SpecFromIter<T,I>>::from_iter
   8: openvm_circuit::arch::extensions::VmChipComplex<SC,RA,PB,SCC>::generate_proving_ctx
   9: openvm_circuit::arch::vm::VirtualMachine<E,VB>::generate_proving_ctx
  10: powdr_openvm::prove
  11: powdr_openvm::tests::compile_and_prove
  12: core::ops::function::FnOnce::call_once
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.

thread 'tests::guest_prove_simple' (1400376) panicked at /home/steve/stark-backend/crates/cuda-common/src/d_buffer.rs:164:49:
GPU free failed: Cuda(CudaError { code: 716, name: "cudaErrorMisalignedAddress", message: "misaligned address" })
stack backtrace:
   0:     0x55c84fd117b2 - <std::sys::backtrace::BacktraceLock::print::DisplayBacktrace as core::fmt::Display>::fmt::h3a13e7dab5691c53
   1:     0x55c84fd22f1f - core::fmt::write::hd8c0f44b642e253d
   2:     0x55c84fcda521 - std::io::Write::write_fmt::hf03c34b98e3cb9e5
   3:     0x55c84fce7942 - std::sys::backtrace::BacktraceLock::print::hf7efbf4d3f6c0c71
   4:     0x55c84fcecd3f - std::panicking::default_hook::{{closure}}::h0ac687f7d570a3b5
   5:     0x55c84fcecb99 - std::panicking::default_hook::hb0e8ee7127da5893
   6:     0x55c84fced475 - std::panicking::panic_with_hook::h3136bc18e19ec6ee
   7:     0x55c84fced25a - std::panicking::panic_handler::{{closure}}::haa99ed2ac62a97d2
   8:     0x55c84fce7a89 - std::sys::backtrace::__rust_end_short_backtrace::he1a6c69637605395
   9:     0x55c84fccd8bd - __rustc[d556568c0434a7c8]::rust_begin_unwind
  10:     0x55c84fd2c2a0 - core::panicking::panic_fmt::hb9dc3f33c24f4370
  11:     0x55c84fd2b896 - core::result::unwrap_failed::h8505ad54330fe7b1
  12:     0x55c84fbd4b23 - <openvm_cuda_common::d_buffer::DeviceBuffer<T> as core::ops::drop::Drop>::drop::h0b4932f7a4d374ed
  13:     0x55c84fbca610 - alloc::sync::Arc<T,A>::drop_slow::h6e8cd5c626fa5b93
  14:     0x55c84f9c8ab6 - <openvm_circuit_primitives::range_tuple::cuda::RangeTupleCheckerChipGPU<_> as openvm_stark_backend::chip::Chip<RA,openvm_cuda_backend::prover_backend::GpuBackend>>::generate_proving_ctx::hec7a6821188692a6
  15:     0x55c84f9d2c80 - <alloc::sync::Arc<C> as openvm_stark_backend::chip::Chip<R,PB>>::generate_proving_ctx::hd77ae4278e3ab378
  16:     0x55c84efa7214 - <core::iter::adapters::map::Map<I,F> as core::iter::traits::iterator::Iterator>::try_fold::h9a89414bd5845c29
  17:     0x55c84efd42c0 - <core::iter::adapters::chain::Chain<A,B> as core::iter::traits::iterator::Iterator>::try_fold::hb24cf86bce89ce1b
  18:     0x55c84ef6d01e - <alloc::vec::Vec<T> as alloc::vec::spec_from_iter::SpecFromIter<T,I>>::from_iter::h18814d42e0cedcca
  19:     0x55c84f309771 - openvm_circuit::arch::extensions::VmChipComplex<SC,RA,PB,SCC>::generate_proving_ctx::ha833d26f2b4d35c5
  20:     0x55c84f54214e - openvm_circuit::arch::vm::VirtualMachine<E,VB>::generate_proving_ctx::h9b43cb78150c2300
  21:     0x55c84f536898 - powdr_openvm::prove::h8999b02c646547b8
  22:     0x55c84f53b69b - powdr_openvm::tests::compile_and_prove::h0ef074af7e97449c
  23:     0x55c84f561b42 - core::ops::function::FnOnce::call_once::h35d130ccf1a67607
  24:     0x55c84f5ab01b - test::__rust_begin_short_backtrace::ha0c21b8c306b8253
  25:     0x55c84f5c0a05 - test::run_test::{{closure}}::h9c0d03c2998f1f3f
  26:     0x55c84f597424 - std::sys::backtrace::__rust_begin_short_backtrace::hfab53282aba2d292
  27:     0x55c84f59ae0a - core::ops::function::FnOnce::call_once{{vtable.shim}}::hce3210c75fe85616
  28:     0x55c84fce24af - std::sys::thread::unix::Thread::new::thread_start::h6ea26e7622e6e954
  29:     0x7d9cd5c9caa4 - start_thread
                               at ./nptl/pthread_create.c:447:8
  30:     0x7d9cd5d29c6c - clone3
                               at ./misc/../sysdeps/unix/sysv/linux/x86_64/clone3.S:78:0
  31:                0x0 - <unknown>

thread 'tests::guest_prove_simple' (1400376) panicked at library/core/src/panicking.rs:236:5:
panic in a destructor during cleanup
thread caused non-unwinding panic. aborting.
error: test failed, to rerun pass `-p powdr-openvm --lib`

Caused by:
  process didn't exit successfully: `/home/steve/powdr/target/release/deps/powdr_openvm-39c4e0f6b0554f48 'tests::guest_prove_simple' --exact --nocapture` (signal: 6, SIGABRT: process abort signal)

qwang98 · 2025-11-27T09:22:07Z

See my answers to the debugging ideas I proposed. Unfortuantely they didn't help solve the bug.

A few more things I tried today:

The error seems related to not generating dummy trace for some chips that has dummy record arenas. To reproduce the error with minimum changes, I started from main branch and skipped the ALU chip in dummy trace generation and Subst creation (so that we don't map from ALU dummy trace to APC trace via Subst). In theory this should only panic at the prover, which it indeed does in some runs, because the APC trace is incorrect. However, in some other runs, RangeTupleChecker panics with the following error. Note that this happens in trace gen before proving.

This "minimum change test vector" is here: powdr-labs/powdr#3458

thread 'tests::guest_prove_simple' (2504511) panicked at /home/steve/.cargo/git/checkouts/openvm-77dd23e285a1262c/60073d7/crates/circuits/primitives/src/range_tuple/cuda.rs:57:63:
called `Result::unwrap()` on an `Err` value: CudaError { code: 700, name: "cudaErrorIllegalAddress", message: "an illegal memory access was encountered" }
stack backtrace:
   0: __rustc::rust_begin_unwind
   1: core::panicking::panic_fmt
   2: core::result::unwrap_failed
   3: <openvm_circuit_primitives::range_tuple::cuda::RangeTupleCheckerChipGPU<_> as openvm_stark_backend::chip::Chip<RA,openvm_cuda_backend::prover_backend::GpuBackend>>::generate_proving_ctx

Debug Keccak, which didn't go very far as I keep getting a similar style of cuda memory access error, only this time it's DeviceBuffer::with_capacity failing to malloc on the device for the Vec<OriginalAir> in PowdrChipGPU::try_generate_witness, for which I'm getting even more confused...

Despite what the latest commit message suggests, it's now confirmed that the memory alignment error is non deterministic depending on the run, which makes it even more perplexing, though it also suggests that main trace is actually correct.

However, the "fix" at least works for some runs, which also means that it's not a "full fix" yet...

The error only happens during RangeTupleChecker trace gen (which is very weird as it doesn't even happen in ALU), and a few further debugging ideas include:

Compare trace gen inputs when the error happens vs not across different runs.
[Update - there are no differences]
Compare generate_proving_ctx chip order when the error happens vs not across different runs.
[Update - It's the same order]
It seems that when I comment out ALU chip related skips for dummy trace gen, the code works properly, so there might be some "chip order matching" issue between VmChipComplex and the new order without ALU. This might also relate to the dummy VmChipComplex, though I still don't see an immediate reason this is related to CUDA memory address misalignment...
[Update - I don't think it's related to the dummy chip complex, because it's only used for generating dummy traces, which are only generated if we have record arenas for them. The chip generation order is always the same for real or dummy chip complex, it's just that not all chips have dummy record arenas to generate trace for.]

Here's the very cryptic error:

thread 'tests::guest_prove_simple' (1400376) panicked at /home/steve/openvm/crates/circuits/primitives/src/range_tuple/cuda.rs:57:63:
called `Result::unwrap()` on an `Err` value: CudaError { code: 716, name: "cudaErrorMisalignedAddress", message: "misaligned address" }
stack backtrace:
   0: __rustc::rust_begin_unwind
   1: core::panicking::panic_fmt
   2: core::result::unwrap_failed
   3: <openvm_circuit_primitives::range_tuple::cuda::RangeTupleCheckerChipGPU<_> as openvm_stark_backend::chip::Chip<RA,openvm_cuda_backend::prover_backend::GpuBackend>>::generate_proving_ctx
   4: <alloc::sync::Arc<C> as openvm_stark_backend::chip::Chip<R,PB>>::generate_proving_ctx
   5: <core::iter::adapters::map::Map<I,F> as core::iter::traits::iterator::Iterator>::try_fold
   6: <core::iter::adapters::chain::Chain<A,B> as core::iter::traits::iterator::Iterator>::try_fold
   7: <alloc::vec::Vec<T> as alloc::vec::spec_from_iter::SpecFromIter<T,I>>::from_iter
   8: openvm_circuit::arch::extensions::VmChipComplex<SC,RA,PB,SCC>::generate_proving_ctx
   9: openvm_circuit::arch::vm::VirtualMachine<E,VB>::generate_proving_ctx
  10: powdr_openvm::prove
  11: powdr_openvm::tests::compile_and_prove
  12: core::ops::function::FnOnce::call_once
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.

thread 'tests::guest_prove_simple' (1400376) panicked at /home/steve/stark-backend/crates/cuda-common/src/d_buffer.rs:164:49:
GPU free failed: Cuda(CudaError { code: 716, name: "cudaErrorMisalignedAddress", message: "misaligned address" })
stack backtrace:
   0:     0x55c84fd117b2 - <std::sys::backtrace::BacktraceLock::print::DisplayBacktrace as core::fmt::Display>::fmt::h3a13e7dab5691c53
   1:     0x55c84fd22f1f - core::fmt::write::hd8c0f44b642e253d
   2:     0x55c84fcda521 - std::io::Write::write_fmt::hf03c34b98e3cb9e5
   3:     0x55c84fce7942 - std::sys::backtrace::BacktraceLock::print::hf7efbf4d3f6c0c71
   4:     0x55c84fcecd3f - std::panicking::default_hook::{{closure}}::h0ac687f7d570a3b5
   5:     0x55c84fcecb99 - std::panicking::default_hook::hb0e8ee7127da5893
   6:     0x55c84fced475 - std::panicking::panic_with_hook::h3136bc18e19ec6ee
   7:     0x55c84fced25a - std::panicking::panic_handler::{{closure}}::haa99ed2ac62a97d2
   8:     0x55c84fce7a89 - std::sys::backtrace::__rust_end_short_backtrace::he1a6c69637605395
   9:     0x55c84fccd8bd - __rustc[d556568c0434a7c8]::rust_begin_unwind
  10:     0x55c84fd2c2a0 - core::panicking::panic_fmt::hb9dc3f33c24f4370
  11:     0x55c84fd2b896 - core::result::unwrap_failed::h8505ad54330fe7b1
  12:     0x55c84fbd4b23 - <openvm_cuda_common::d_buffer::DeviceBuffer<T> as core::ops::drop::Drop>::drop::h0b4932f7a4d374ed
  13:     0x55c84fbca610 - alloc::sync::Arc<T,A>::drop_slow::h6e8cd5c626fa5b93
  14:     0x55c84f9c8ab6 - <openvm_circuit_primitives::range_tuple::cuda::RangeTupleCheckerChipGPU<_> as openvm_stark_backend::chip::Chip<RA,openvm_cuda_backend::prover_backend::GpuBackend>>::generate_proving_ctx::hec7a6821188692a6
  15:     0x55c84f9d2c80 - <alloc::sync::Arc<C> as openvm_stark_backend::chip::Chip<R,PB>>::generate_proving_ctx::hd77ae4278e3ab378
  16:     0x55c84efa7214 - <core::iter::adapters::map::Map<I,F> as core::iter::traits::iterator::Iterator>::try_fold::h9a89414bd5845c29
  17:     0x55c84efd42c0 - <core::iter::adapters::chain::Chain<A,B> as core::iter::traits::iterator::Iterator>::try_fold::hb24cf86bce89ce1b
  18:     0x55c84ef6d01e - <alloc::vec::Vec<T> as alloc::vec::spec_from_iter::SpecFromIter<T,I>>::from_iter::h18814d42e0cedcca
  19:     0x55c84f309771 - openvm_circuit::arch::extensions::VmChipComplex<SC,RA,PB,SCC>::generate_proving_ctx::ha833d26f2b4d35c5
  20:     0x55c84f54214e - openvm_circuit::arch::vm::VirtualMachine<E,VB>::generate_proving_ctx::h9b43cb78150c2300
  21:     0x55c84f536898 - powdr_openvm::prove::h8999b02c646547b8
  22:     0x55c84f53b69b - powdr_openvm::tests::compile_and_prove::h0ef074af7e97449c
  23:     0x55c84f561b42 - core::ops::function::FnOnce::call_once::h35d130ccf1a67607
  24:     0x55c84f5ab01b - test::__rust_begin_short_backtrace::ha0c21b8c306b8253
  25:     0x55c84f5c0a05 - test::run_test::{{closure}}::h9c0d03c2998f1f3f
  26:     0x55c84f597424 - std::sys::backtrace::__rust_begin_short_backtrace::hfab53282aba2d292
  27:     0x55c84f59ae0a - core::ops::function::FnOnce::call_once{{vtable.shim}}::hce3210c75fe85616
  28:     0x55c84fce24af - std::sys::thread::unix::Thread::new::thread_start::h6ea26e7622e6e954
  29:     0x7d9cd5c9caa4 - start_thread
                               at ./nptl/pthread_create.c:447:8
  30:     0x7d9cd5d29c6c - clone3
                               at ./misc/../sysdeps/unix/sysv/linux/x86_64/clone3.S:78:0
  31:                0x0 - <unknown>

thread 'tests::guest_prove_simple' (1400376) panicked at library/core/src/panicking.rs:236:5:
panic in a destructor during cleanup
thread caused non-unwinding panic. aborting.
error: test failed, to rerun pass `-p powdr-openvm --lib`

Caused by:
  process didn't exit successfully: `/home/steve/powdr/target/release/deps/powdr_openvm-39c4e0f6b0554f48 'tests::guest_prove_simple' --exact --nocapture` (signal: 6, SIGABRT: process abort signal)

…new instead

qwang98 · 2025-12-02T09:33:40Z

crates/circuits/primitives/cuda/include/primitives/histogram.cuh

+            limbs.write_new(i, limb_u32);
+            if (!limbs.is_apc) {
+                add_count(limb_u32, min(bits_remaining, range_max_bits)); 
+            }


This is the only decompose_new difference from decompose, which has:

limbs[i] = limb_u32; add_count(limb_u32, min(bits_remaining, range_max_bits));

qwang98 · 2025-12-02T09:35:24Z

crates/circuits/primitives/cuda/include/primitives/less_than.cuh

+    const size_t lower_decomp_len,
+    RowSliceNew lower_decomp
+) {
+    rc.decompose_new(y - x - 1, max_bits, lower_decomp, lower_decomp_len);


No difference from generate_subrow_new except that we use decompose_new.

qwang98 · 2025-12-02T09:42:05Z

crates/vm/cuda/include/system/memory/controller.cuh

    }

+    __device__ void fill_new(RowSliceNew row, uint32_t prev_timestamp, uint32_t timestamp) {
+        AssertLessThan::generate_subrow_new(


The same as fill except that we use generate_subrow_new here.

qwang98 · 2025-12-02T09:48:57Z

extensions/rv32im/circuit/cuda/include/rv32im/adapters/alu.cuh

+            if (!rs2_aux.is_apc) {
+                bitwise_lookup.add_range(record.rs2 & mask, (record.rs2 >> RV32_CELL_BITS) & mask);
+            }


The only difference from adapter.fill_trace_row is here, where we filter side effects by is_apc.

All other differences are using the new versions of APIs.

qwang98 commented Nov 17, 2025

View reviewed changes

Schaeff reviewed Nov 17, 2025

View reviewed changes

test identity case

5b3078e

qwang98 force-pushed the skip-dummy-wip branch from 07b6a29 to 5b3078e Compare November 17, 2025 20:35

qwang98 added 2 commits November 17, 2025 22:03

fix compilation

26f4389

remove unwanted fill_trace_new

2acdfa2

qwang98 commented Nov 17, 2025

View reviewed changes

crates/circuits/primitives/cuda/include/primitives/histogram.cuh Show resolved Hide resolved

qwang98 commented Nov 17, 2025

View reviewed changes

qwang98 requested a review from Schaeff November 17, 2025 21:25

qwang98 changed the title ~~[OVM GPU] Skip dummy witgen for APC~~ [OVM GPU] direct to APC trace gen Nov 17, 2025

Schaeff reviewed Nov 18, 2025

View reviewed changes

added offset support for APC and sub; currently backward compatible w…

94f022a

…ith non-APC but not sure if APC works

qwang98 commented Nov 18, 2025

View reviewed changes

Schaeff reviewed Nov 19, 2025

View reviewed changes

changes so far

87ff97c

qwang98 force-pushed the skip-dummy-wip branch from 26f3760 to 87ff97c Compare November 19, 2025 22:36

qwang98 mentioned this pull request Nov 19, 2025

[WIP] Direct to APC trace gen powdr-labs/powdr#3452

Draft

qwang98 added 3 commits November 25, 2025 13:15

wip still debugging but first execution of the single apc in the fibo…

6ac91a2

… simple example works

trace for all rows generated

a913881

finally fixed the memory misalignment bug

67331b4

qwang98 mentioned this pull request Nov 27, 2025

[DO NOT MERGE] minimum change reproducing the CUDA memory bug in direct to APC powdr-labs/powdr#3458

Draft

qwang98 added 7 commits November 27, 2025 10:57

pass APC width and commented out some prints

7f6ee7b

removed gap calculation from write array

d1b735d

simplified COL_WRITE_VALUE_NEW by removing subs and perform in write_…

c44dafe

…new instead

use write_new for limbs wirte

3b297e8

skip add_count if we in apc

64a946a

remove subs from most if not all API

effb1d0

finalize decompose_new

fd69e72

qwang98 commented Dec 2, 2025

View reviewed changes

finalized adapter.fill_trace_row_new

46394a1

qwang98 commented Dec 2, 2025

View reviewed changes

qwang98 added 4 commits December 2, 2025 11:24

finalize core.fill_trace_row

51d2efd

reordered cuda and rust APIs

6063ea4

reordered rust and cuda ABIs

cead7cb

removed all prints

54aad09

qwang98 marked this pull request as ready for review December 2, 2025 11:59

		RowSlice apc_row(d_apc_trace + apc_row_index[idx], height);
		auto const sub = subs[idx * width]; // offset the subs to the corresponding dummy row

	COL_WRITE_VALUE_APC(row, AluNativeAdapterCols, a_pointer, Fp::fromRaw(rec.a_ptr), sub, offset);
	COL_WRITE_VALUE_APC(apc_row, AluNativeAdapterCols, a_pointer, Fp::fromRaw(rec.a_ptr), sub, offset);

		core.fill_trace_row(row.slice_from(COL_INDEX(Rv32BaseAluCols, core)), rec.core);
		core.fill_trace_row_new(row.slice_from(COL_INDEX(Rv32BaseAluCols, core) - number_of_gaps_in(sub, sizeof(Rv32BaseAluCols<uint8_t>))), rec.core, sub);

		RowSlice row(d_trace + idx, height);
		RowSliceNew row(d_trace + idx / calls_per_apc_row, height, 0, 0); // we need to slice to the correct APC row, but if non-APC it's dividing by 1 and therefore the same idx

[OVM GPU] direct to APC trace gen #50

Are you sure you want to change the base?

[OVM GPU] direct to APC trace gen #50

Uh oh!

Conversation

qwang98 commented Nov 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

qwang98 Nov 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

qwang98 commented Nov 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

qwang98 commented Nov 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

qwang98 commented Nov 17, 2025 •

edited

Loading

qwang98 Nov 18, 2025 •

edited

Loading

qwang98 commented Nov 26, 2025 •

edited

Loading

qwang98 commented Nov 27, 2025 •

edited

Loading