Skip to content

Conversation

@qwang98
Copy link

@qwang98 qwang98 commented Nov 17, 2025

WIP. Currently just a prototype for ALU chip on the OVM side. If we were to go down this route, a similar modification for all chips is needed.

Passes guest_prove_simple NON-DETERMINISTICALLY in some runs, likely due to a bug that's more on the Powdr end than on OVM end: powdr-labs/powdr#3458. Keccak test entirely fails, likely due to the same bug.

Current draft represents minimum change to the chip implementations, by pushing API changes as much as possible to RowSliceNew. The only changes to chips are:

  • Create the first APC RowSlice according to additional injected device inputs from CPU.
  • Conditionally skip modifications to dummy periphery (e.g. add_count) whether row_slice.is_apc is true.

Otherwise, most APIs are simply replaced with their _new version, without any changes to the arguments.

TODOs:

  • Pass Fibo test with identity matrix for subs for APC dummy trace gen and non-APC trace gen
  • Pass Fibo test with direct write to APC trace
  • Pass Keccak test
  • Benchmark ALU chip and decide whether to continue

There's nothing static added to this yet, as currently everything (including whether we take the APC witgen path vs original instruction path for ALU chip for current call to genenerate_proving_ctx) is run time values.

Comment on lines 75 to 82
#define COL_WRITE_VALUE_APC(APC_ROW, STRUCT, FIELD, VALUE, SUB, OFFSET) {
if SUB[COL_INDEX(STRUCT, FIELD) + OFFSET] != UINT32_MAX {
(APC_ROW).write(
COL_INDEX(STRUCT, FIELD) + OFFSET,
VALUE
);
}
}
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only write if not optimized away based on SUB.

Comment on lines 40 to 46
if !apc_row.is_null() {
COL_WRITE_VALUE_APC(apc_row, AluNativeAdapterCols, from_state.timestamp, Fp(rec.from_timestamp), sub, offset);
COL_WRITE_VALUE_APC(row, AluNativeAdapterCols, a_pointer, Fp::fromRaw(rec.a_ptr), sub, offset);
COL_WRITE_VALUE_APC(row, AluNativeAdapterCols, b_pointer, Fp::fromRaw(rec.b), sub, offset);
COL_WRITE_VALUE_APC(row, AluNativeAdapterCols, c_pointer, Fp::fromRaw(rec.c), sub, offset);

// TODO: adapt the rest similar to above
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not complete but just showing how we use COL_WRITE_VALUE_APC.


impl Chip<DenseRecordArena, GpuBackend> for Rv32BaseAluChipGpu {
fn generate_proving_ctx(&self, arena: DenseRecordArena) -> AirProvingContext<GpuBackend> {
fn generate_proving_ctx(&self, arena: DenseRecordArena, d_apc_trace: DeviceMatrix<F>, subs: Option<Vec<Vec<u32>>>, apc_row_index: Option<Vec<u32>>) -> AirProvingContext<GpuBackend> {
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

d_apc_trace should be supplied by PowdrTraceGenerator::generate_witness as Some variant and set to None for regular trace gen when no APC involved. Similar for subs and apc_row_index.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

subs will have the same length as # of dummy columns, where entry is indexed by dummy column index and value is APC index. Value is hardcoded to UINT32_MAX if the dummy column is optimized away in APC.

This vector design is needed because GPU has no native map encoding and because this vector should be not too sparse so we don't waste too much space.

Comment on lines 43 to 44
RowSlice apc_row(d_apc_trace + apc_row_index[idx], height);
auto const sub = subs[idx * width]; // offset the subs to the corresponding dummy row
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here we need the sub for a specific dummy row, because each row might be optimized differently within an APC.

The apc_row should also be specific to the dummy row, because dummy trace gen can involve more than one APC row.

COL_WRITE_VALUE(row, AluNativeAdapterCols, reads_aux[i].is_immediate, Fp::one());
if (i == 0) {
COL_WRITE_VALUE(row, AluNativeAdapterCols, e_as, Fp(AS_IMMEDIATE));
__device__ void fill_trace_row(RowSlice row, AluNativeAdapterRecord const &rec, RowSlice apc_row, uint32_t *sub, uint32_t offset) {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is stopping us from passing row and apc_row as the same argument? Then it removes the conditional in the body? Also COL_WRITE_VALUE_APC and COL_WRITE_VALUE can be unified?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah that's a good point re unifying row and apc_row.

I think main point on the COL_WRITE_VALUE_APC is speed? Basically one will go through subs and the other won't, but this is also at the cost of a larger diff.

Comment on lines 85 to 92
COL_WRITE_ARRAY(row, AluNativeAdapterCols, write_aux.prev_data, rec.write_aux.prev_data);
mem_helper.fill(
row.slice_from(COL_INDEX(AluNativeAdapterCols, write_aux)),
rec.write_aux.prev_timestamp,
rec.from_timestamp + 2
);
}

COL_WRITE_ARRAY(row, AluNativeAdapterCols, write_aux.prev_data, rec.write_aux.prev_data);
mem_helper.fill(
row.slice_from(COL_INDEX(AluNativeAdapterCols, write_aux)),
rec.write_aux.prev_timestamp,
rec.from_timestamp + 2
);

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This also needs to go through the subs

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes exactly. Haven't modified this yet, but will be a similar style as COL_WRITE_VALUE_APC

__device__ void fill_trace_row(RowSlice row, AluNativeAdapterRecord const &rec, RowSlice apc_row, uint32_t *sub, uint32_t offset) {
if !apc_row.is_null() {
COL_WRITE_VALUE_APC(apc_row, AluNativeAdapterCols, from_state.timestamp, Fp(rec.from_timestamp), sub, offset);
COL_WRITE_VALUE_APC(row, AluNativeAdapterCols, a_pointer, Fp::fromRaw(rec.a_ptr), sub, offset);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
COL_WRITE_VALUE_APC(row, AluNativeAdapterCols, a_pointer, Fp::fromRaw(rec.a_ptr), sub, offset);
COL_WRITE_VALUE_APC(apc_row, AluNativeAdapterCols, a_pointer, Fp::fromRaw(rec.a_ptr), sub, offset);

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

etc

Comment on lines 113 to 122
__device__ __forceinline__ size_t number_of_gaps_in(const uint32_t *sub, size_t len) {
size_t gaps = 0;
#pragma unroll
for (size_t i = 0; i < len; ++i) {
if (sub[i] == UINT32_MAX) {
++gaps;
}
}
return gaps;
}
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should return 0 for a sub vector without UINT32_MAX.

Comment on lines 35 to 45
uint32_t idx = blockIdx.x * blockDim.x + threadIdx.x;
RowSlice row(d_trace + idx, height);
if (idx < d_records.len()) {
auto const &rec = d_records[idx];
// RowSlice apc_row(d_apc_trace + apc_row_index[idx], height);
// auto const sub = subs[idx * width]; // offset the subs to the corresponding dummy row
uint32_t *sub = subs;
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently subs is just an identity vector, but will involve a lot more precomputation if we want them to work with APC.

Comment on lines 48 to 55
core.fill_trace_row(row.slice_from(COL_INDEX(Rv32BaseAluCols, core)), rec.core);
core.fill_trace_row_new(row.slice_from(COL_INDEX(Rv32BaseAluCols, core) - number_of_gaps_in(sub, sizeof(Rv32BaseAluCols<uint8_t>))), rec.core, sub);
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that this is a very common style patch. Basically whenever we slice an (APC) row, we now slice it but also "retract" it by the number of gaps in the optimization. For non-APC row, we should "retract" nothing, so should be "backward compatible" with non-APC calls of the function.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes maybe there's a more natural way to encode it, but if this works we can improve later

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally this would be moved into the slice_from function though, and we need to store the offset in the rowslice

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@qwang98 qwang98 requested a review from Schaeff November 17, 2025 21:25
@qwang98 qwang98 changed the title [OVM GPU] Skip dummy witgen for APC [OVM GPU] direct to APC trace gen Nov 17, 2025
let d_records = records.to_device().unwrap();
let d_trace = DeviceMatrix::<F>::with_capacity(trace_height, trace_width);

// TODO: use actual sub not hardcoded
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess for software it would still use the identity

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes.

Comment on lines +9 to +13
struct RowSliceNew {
Fp *ptr;
size_t stride;
size_t optimized_offset;
size_t dummy_offset;
Copy link
Author

@qwang98 qwang98 Nov 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added this new RowSliceNew struct that stores the offsets.

optimized_offset is the smaller one, basically the cumulative dummy_offset subtracted by gap.

dummy_offset is the larger one, basically the cumulative COL_INDEX of original columns.

Comment on lines 46 to 54
__device__ __forceinline__ void write_array_new(size_t column_index, size_t length, const T *values, const uint32_t *sub)
const {
#pragma unroll
for (size_t i = 0; i < length; i++) {
if (sub[i] != UINT32_MAX) {
ptr[(column_index + i) * stride] = values[i];
}
}
}
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might need to update this as well but currently don't see a bug.

Comment on lines 159 to 167
/// Conditionally write a single value into `FIELD` based on APC sub-columns.
#define COL_WRITE_VALUE_NEW(ROW, STRUCT, FIELD, VALUE, SUB) \
do { \
const size_t _col_idx = COL_INDEX(STRUCT, FIELD); \
const auto _apc_idx = (SUB)[_col_idx + ROW.dummy_offset]; \
if (_apc_idx != UINT32_MAX) { \
(ROW).write(_apc_idx - ROW.optimized_offset, VALUE); \
} \
} while (0)
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the other key update that computes relative sub from absolute subs:

  1. Convert relative _col_idx to absolute sub index by adding ROW.dummy_offset, because the index is based on original dummy offset.
  2. _apc_idx is absolute APC index.
  3. Because now we also row slice APC trace by post-optimization offset, we need relative APC index when writing, and that's why we subtract ROW.optimized_offset from absolute APC index to obtain the relative APC index.

This should be fully backward compatible, because for non-APC use case, optimized_offset is the same as dummy_offset (as there are no gaps), so adding and subtracting by the same thing is equivalent to using the identity matrix.

Comment on lines 36 to 41
RowSlice row(d_trace + idx, height);
RowSliceNew row(d_trace + idx / calls_per_apc_row, height, 0, 0); // we need to slice to the correct APC row, but if non-APC it's dividing by 1 and therefore the same idx
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Initialize row with zero for both optimized_offset and dummy_offset.

In the case of non-APC row, calls_per_apc_row is hardcoded to 1, so it's basically the same as d_trace + idx.

In the case of APC row, we integer divide by calls_per_apc_row, so we access to the correct APC slice.

auto const &rec = d_records[idx];
// RowSlice apc_row(d_apc_trace + apc_row_index[idx], height);
// auto const sub = subs[idx * width]; // offset the subs to the corresponding dummy row
uint32_t *sub = &subs[(idx % calls_per_apc_row) * width]; // dummy width
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need the correct sub for the instruction, which depends on which record it is in.

All sub should have the same width across instructions, which is convenient.

Comment on lines 71 to 73
__device__ __forceinline__ RowSliceNew slice_from(size_t column_index, uint32_t gap) const {
return RowSliceNew(ptr + (column_index - gap) * stride, stride, optimized_offset + column_index - gap, dummy_offset + column_index);
}
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is another key update:

  1. In case of APC, we advance by column_index - gap, and accumulate optimized_offset by the post optimization width (column_index - gap) whereas we accumulate dummy_offset by the pre optimization width (column_index).
  2. In case of non-APC, we advance by column_index (because gap is 0). For the same reason, we advance optimized_offset and dummy_offset by column_index.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we can make this function take the same arguments as originally (only the column_index). Could it figure out the gap from the substitutions if it also had that internally? That would minimize the changes needed in the original chips.

Comment on lines 156 to 163
/// Write a single value into `FIELD` of struct `STRUCT<T>` at a given row.
#define COL_WRITE_VALUE(ROW, STRUCT, FIELD, VALUE) (ROW).write(COL_INDEX(STRUCT, FIELD), VALUE)

/// Conditionally write a single value into `FIELD` based on APC sub-columns.
#define COL_WRITE_VALUE_NEW(ROW, STRUCT, FIELD, VALUE, SUB) \
do { \
const size_t _col_idx = COL_INDEX(STRUCT, FIELD); \
const auto _apc_idx = (SUB)[_col_idx + ROW.dummy_offset]; \
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here: if the sub is stored in the RowSlice, we can keep the same interface for COL_WRITE_VALUE etc

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point! :)

Will implement once the current version works end to end, so that other chips will have minimal changes.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is partially implemented, but I need to clean up by removing SUB from some APIs.

);
}

__device__ void fill_trace_row_new(RowSliceNew row, Rv32BaseAluAdapterRecord record, uint32_t *sub) {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

then here we don't have to pass the subs everywhere and the whole thing would be unchanged?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above.

@qwang98
Copy link
Author

qwang98 commented Nov 26, 2025

Despite what the latest commit message suggests, it's now confirmed that the memory alignment error is non deterministic depending on the run, which makes it even more perplexing, though it also suggests that main trace is actually correct.

However, the "fix" at least works for some runs, which also means that it's not a "full fix" yet...

The error only happens during RangeTupleChecker trace gen (which is very weird as it doesn't even happen in ALU), and a few further debugging ideas include:

  1. Compare trace gen inputs when the error happens vs not across different runs.
    [Update - there are no differences]
  2. Compare generate_proving_ctx chip order when the error happens vs not across different runs.
    [Update - It's the same order]
  3. It seems that when I comment out ALU chip related skips for dummy trace gen, the code works properly, so there might be some "chip order matching" issue between VmChipComplex and the new order without ALU. This might also relate to the dummy VmChipComplex, though I still don't see an immediate reason this is related to CUDA memory address misalignment...
    [Update - I don't think it's related to the dummy chip complex, because it's only used for generating dummy traces, which are only generated if we have record arenas for them. The chip generation order is always the same for real or dummy chip complex, it's just that not all chips have dummy record arenas to generate trace for.]

Here's the very cryptic error:

thread 'tests::guest_prove_simple' (1400376) panicked at /home/steve/openvm/crates/circuits/primitives/src/range_tuple/cuda.rs:57:63:
called `Result::unwrap()` on an `Err` value: CudaError { code: 716, name: "cudaErrorMisalignedAddress", message: "misaligned address" }
stack backtrace:
   0: __rustc::rust_begin_unwind
   1: core::panicking::panic_fmt
   2: core::result::unwrap_failed
   3: <openvm_circuit_primitives::range_tuple::cuda::RangeTupleCheckerChipGPU<_> as openvm_stark_backend::chip::Chip<RA,openvm_cuda_backend::prover_backend::GpuBackend>>::generate_proving_ctx
   4: <alloc::sync::Arc<C> as openvm_stark_backend::chip::Chip<R,PB>>::generate_proving_ctx
   5: <core::iter::adapters::map::Map<I,F> as core::iter::traits::iterator::Iterator>::try_fold
   6: <core::iter::adapters::chain::Chain<A,B> as core::iter::traits::iterator::Iterator>::try_fold
   7: <alloc::vec::Vec<T> as alloc::vec::spec_from_iter::SpecFromIter<T,I>>::from_iter
   8: openvm_circuit::arch::extensions::VmChipComplex<SC,RA,PB,SCC>::generate_proving_ctx
   9: openvm_circuit::arch::vm::VirtualMachine<E,VB>::generate_proving_ctx
  10: powdr_openvm::prove
  11: powdr_openvm::tests::compile_and_prove
  12: core::ops::function::FnOnce::call_once
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.

thread 'tests::guest_prove_simple' (1400376) panicked at /home/steve/stark-backend/crates/cuda-common/src/d_buffer.rs:164:49:
GPU free failed: Cuda(CudaError { code: 716, name: "cudaErrorMisalignedAddress", message: "misaligned address" })
stack backtrace:
   0:     0x55c84fd117b2 - <std::sys::backtrace::BacktraceLock::print::DisplayBacktrace as core::fmt::Display>::fmt::h3a13e7dab5691c53
   1:     0x55c84fd22f1f - core::fmt::write::hd8c0f44b642e253d
   2:     0x55c84fcda521 - std::io::Write::write_fmt::hf03c34b98e3cb9e5
   3:     0x55c84fce7942 - std::sys::backtrace::BacktraceLock::print::hf7efbf4d3f6c0c71
   4:     0x55c84fcecd3f - std::panicking::default_hook::{{closure}}::h0ac687f7d570a3b5
   5:     0x55c84fcecb99 - std::panicking::default_hook::hb0e8ee7127da5893
   6:     0x55c84fced475 - std::panicking::panic_with_hook::h3136bc18e19ec6ee
   7:     0x55c84fced25a - std::panicking::panic_handler::{{closure}}::haa99ed2ac62a97d2
   8:     0x55c84fce7a89 - std::sys::backtrace::__rust_end_short_backtrace::he1a6c69637605395
   9:     0x55c84fccd8bd - __rustc[d556568c0434a7c8]::rust_begin_unwind
  10:     0x55c84fd2c2a0 - core::panicking::panic_fmt::hb9dc3f33c24f4370
  11:     0x55c84fd2b896 - core::result::unwrap_failed::h8505ad54330fe7b1
  12:     0x55c84fbd4b23 - <openvm_cuda_common::d_buffer::DeviceBuffer<T> as core::ops::drop::Drop>::drop::h0b4932f7a4d374ed
  13:     0x55c84fbca610 - alloc::sync::Arc<T,A>::drop_slow::h6e8cd5c626fa5b93
  14:     0x55c84f9c8ab6 - <openvm_circuit_primitives::range_tuple::cuda::RangeTupleCheckerChipGPU<_> as openvm_stark_backend::chip::Chip<RA,openvm_cuda_backend::prover_backend::GpuBackend>>::generate_proving_ctx::hec7a6821188692a6
  15:     0x55c84f9d2c80 - <alloc::sync::Arc<C> as openvm_stark_backend::chip::Chip<R,PB>>::generate_proving_ctx::hd77ae4278e3ab378
  16:     0x55c84efa7214 - <core::iter::adapters::map::Map<I,F> as core::iter::traits::iterator::Iterator>::try_fold::h9a89414bd5845c29
  17:     0x55c84efd42c0 - <core::iter::adapters::chain::Chain<A,B> as core::iter::traits::iterator::Iterator>::try_fold::hb24cf86bce89ce1b
  18:     0x55c84ef6d01e - <alloc::vec::Vec<T> as alloc::vec::spec_from_iter::SpecFromIter<T,I>>::from_iter::h18814d42e0cedcca
  19:     0x55c84f309771 - openvm_circuit::arch::extensions::VmChipComplex<SC,RA,PB,SCC>::generate_proving_ctx::ha833d26f2b4d35c5
  20:     0x55c84f54214e - openvm_circuit::arch::vm::VirtualMachine<E,VB>::generate_proving_ctx::h9b43cb78150c2300
  21:     0x55c84f536898 - powdr_openvm::prove::h8999b02c646547b8
  22:     0x55c84f53b69b - powdr_openvm::tests::compile_and_prove::h0ef074af7e97449c
  23:     0x55c84f561b42 - core::ops::function::FnOnce::call_once::h35d130ccf1a67607
  24:     0x55c84f5ab01b - test::__rust_begin_short_backtrace::ha0c21b8c306b8253
  25:     0x55c84f5c0a05 - test::run_test::{{closure}}::h9c0d03c2998f1f3f
  26:     0x55c84f597424 - std::sys::backtrace::__rust_begin_short_backtrace::hfab53282aba2d292
  27:     0x55c84f59ae0a - core::ops::function::FnOnce::call_once{{vtable.shim}}::hce3210c75fe85616
  28:     0x55c84fce24af - std::sys::thread::unix::Thread::new::thread_start::h6ea26e7622e6e954
  29:     0x7d9cd5c9caa4 - start_thread
                               at ./nptl/pthread_create.c:447:8
  30:     0x7d9cd5d29c6c - clone3
                               at ./misc/../sysdeps/unix/sysv/linux/x86_64/clone3.S:78:0
  31:                0x0 - <unknown>

thread 'tests::guest_prove_simple' (1400376) panicked at library/core/src/panicking.rs:236:5:
panic in a destructor during cleanup
thread caused non-unwinding panic. aborting.
error: test failed, to rerun pass `-p powdr-openvm --lib`

Caused by:
  process didn't exit successfully: `/home/steve/powdr/target/release/deps/powdr_openvm-39c4e0f6b0554f48 'tests::guest_prove_simple' --exact --nocapture` (signal: 6, SIGABRT: process abort signal)

@qwang98
Copy link
Author

qwang98 commented Nov 27, 2025

See my answers to the debugging ideas I proposed. Unfortuantely they didn't help solve the bug.

A few more things I tried today:

  1. The error seems related to not generating dummy trace for some chips that has dummy record arenas. To reproduce the error with minimum changes, I started from main branch and skipped the ALU chip in dummy trace generation and Subst creation (so that we don't map from ALU dummy trace to APC trace via Subst). In theory this should only panic at the prover, which it indeed does in some runs, because the APC trace is incorrect. However, in some other runs, RangeTupleChecker panics with the following error. Note that this happens in trace gen before proving.

This "minimum change test vector" is here: powdr-labs/powdr#3458

thread 'tests::guest_prove_simple' (2504511) panicked at /home/steve/.cargo/git/checkouts/openvm-77dd23e285a1262c/60073d7/crates/circuits/primitives/src/range_tuple/cuda.rs:57:63:
called `Result::unwrap()` on an `Err` value: CudaError { code: 700, name: "cudaErrorIllegalAddress", message: "an illegal memory access was encountered" }
stack backtrace:
   0: __rustc::rust_begin_unwind
   1: core::panicking::panic_fmt
   2: core::result::unwrap_failed
   3: <openvm_circuit_primitives::range_tuple::cuda::RangeTupleCheckerChipGPU<_> as openvm_stark_backend::chip::Chip<RA,openvm_cuda_backend::prover_backend::GpuBackend>>::generate_proving_ctx
  1. Debug Keccak, which didn't go very far as I keep getting a similar style of cuda memory access error, only this time it's DeviceBuffer::with_capacity failing to malloc on the device for the Vec<OriginalAir> in PowdrChipGPU::try_generate_witness, for which I'm getting even more confused...

Despite what the latest commit message suggests, it's now confirmed that the memory alignment error is non deterministic depending on the run, which makes it even more perplexing, though it also suggests that main trace is actually correct.

However, the "fix" at least works for some runs, which also means that it's not a "full fix" yet...

The error only happens during RangeTupleChecker trace gen (which is very weird as it doesn't even happen in ALU), and a few further debugging ideas include:

  1. Compare trace gen inputs when the error happens vs not across different runs.
    [Update - there are no differences]
  2. Compare generate_proving_ctx chip order when the error happens vs not across different runs.
    [Update - It's the same order]
  3. It seems that when I comment out ALU chip related skips for dummy trace gen, the code works properly, so there might be some "chip order matching" issue between VmChipComplex and the new order without ALU. This might also relate to the dummy VmChipComplex, though I still don't see an immediate reason this is related to CUDA memory address misalignment...
    [Update - I don't think it's related to the dummy chip complex, because it's only used for generating dummy traces, which are only generated if we have record arenas for them. The chip generation order is always the same for real or dummy chip complex, it's just that not all chips have dummy record arenas to generate trace for.]

Here's the very cryptic error:

thread 'tests::guest_prove_simple' (1400376) panicked at /home/steve/openvm/crates/circuits/primitives/src/range_tuple/cuda.rs:57:63:
called `Result::unwrap()` on an `Err` value: CudaError { code: 716, name: "cudaErrorMisalignedAddress", message: "misaligned address" }
stack backtrace:
   0: __rustc::rust_begin_unwind
   1: core::panicking::panic_fmt
   2: core::result::unwrap_failed
   3: <openvm_circuit_primitives::range_tuple::cuda::RangeTupleCheckerChipGPU<_> as openvm_stark_backend::chip::Chip<RA,openvm_cuda_backend::prover_backend::GpuBackend>>::generate_proving_ctx
   4: <alloc::sync::Arc<C> as openvm_stark_backend::chip::Chip<R,PB>>::generate_proving_ctx
   5: <core::iter::adapters::map::Map<I,F> as core::iter::traits::iterator::Iterator>::try_fold
   6: <core::iter::adapters::chain::Chain<A,B> as core::iter::traits::iterator::Iterator>::try_fold
   7: <alloc::vec::Vec<T> as alloc::vec::spec_from_iter::SpecFromIter<T,I>>::from_iter
   8: openvm_circuit::arch::extensions::VmChipComplex<SC,RA,PB,SCC>::generate_proving_ctx
   9: openvm_circuit::arch::vm::VirtualMachine<E,VB>::generate_proving_ctx
  10: powdr_openvm::prove
  11: powdr_openvm::tests::compile_and_prove
  12: core::ops::function::FnOnce::call_once
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.

thread 'tests::guest_prove_simple' (1400376) panicked at /home/steve/stark-backend/crates/cuda-common/src/d_buffer.rs:164:49:
GPU free failed: Cuda(CudaError { code: 716, name: "cudaErrorMisalignedAddress", message: "misaligned address" })
stack backtrace:
   0:     0x55c84fd117b2 - <std::sys::backtrace::BacktraceLock::print::DisplayBacktrace as core::fmt::Display>::fmt::h3a13e7dab5691c53
   1:     0x55c84fd22f1f - core::fmt::write::hd8c0f44b642e253d
   2:     0x55c84fcda521 - std::io::Write::write_fmt::hf03c34b98e3cb9e5
   3:     0x55c84fce7942 - std::sys::backtrace::BacktraceLock::print::hf7efbf4d3f6c0c71
   4:     0x55c84fcecd3f - std::panicking::default_hook::{{closure}}::h0ac687f7d570a3b5
   5:     0x55c84fcecb99 - std::panicking::default_hook::hb0e8ee7127da5893
   6:     0x55c84fced475 - std::panicking::panic_with_hook::h3136bc18e19ec6ee
   7:     0x55c84fced25a - std::panicking::panic_handler::{{closure}}::haa99ed2ac62a97d2
   8:     0x55c84fce7a89 - std::sys::backtrace::__rust_end_short_backtrace::he1a6c69637605395
   9:     0x55c84fccd8bd - __rustc[d556568c0434a7c8]::rust_begin_unwind
  10:     0x55c84fd2c2a0 - core::panicking::panic_fmt::hb9dc3f33c24f4370
  11:     0x55c84fd2b896 - core::result::unwrap_failed::h8505ad54330fe7b1
  12:     0x55c84fbd4b23 - <openvm_cuda_common::d_buffer::DeviceBuffer<T> as core::ops::drop::Drop>::drop::h0b4932f7a4d374ed
  13:     0x55c84fbca610 - alloc::sync::Arc<T,A>::drop_slow::h6e8cd5c626fa5b93
  14:     0x55c84f9c8ab6 - <openvm_circuit_primitives::range_tuple::cuda::RangeTupleCheckerChipGPU<_> as openvm_stark_backend::chip::Chip<RA,openvm_cuda_backend::prover_backend::GpuBackend>>::generate_proving_ctx::hec7a6821188692a6
  15:     0x55c84f9d2c80 - <alloc::sync::Arc<C> as openvm_stark_backend::chip::Chip<R,PB>>::generate_proving_ctx::hd77ae4278e3ab378
  16:     0x55c84efa7214 - <core::iter::adapters::map::Map<I,F> as core::iter::traits::iterator::Iterator>::try_fold::h9a89414bd5845c29
  17:     0x55c84efd42c0 - <core::iter::adapters::chain::Chain<A,B> as core::iter::traits::iterator::Iterator>::try_fold::hb24cf86bce89ce1b
  18:     0x55c84ef6d01e - <alloc::vec::Vec<T> as alloc::vec::spec_from_iter::SpecFromIter<T,I>>::from_iter::h18814d42e0cedcca
  19:     0x55c84f309771 - openvm_circuit::arch::extensions::VmChipComplex<SC,RA,PB,SCC>::generate_proving_ctx::ha833d26f2b4d35c5
  20:     0x55c84f54214e - openvm_circuit::arch::vm::VirtualMachine<E,VB>::generate_proving_ctx::h9b43cb78150c2300
  21:     0x55c84f536898 - powdr_openvm::prove::h8999b02c646547b8
  22:     0x55c84f53b69b - powdr_openvm::tests::compile_and_prove::h0ef074af7e97449c
  23:     0x55c84f561b42 - core::ops::function::FnOnce::call_once::h35d130ccf1a67607
  24:     0x55c84f5ab01b - test::__rust_begin_short_backtrace::ha0c21b8c306b8253
  25:     0x55c84f5c0a05 - test::run_test::{{closure}}::h9c0d03c2998f1f3f
  26:     0x55c84f597424 - std::sys::backtrace::__rust_begin_short_backtrace::hfab53282aba2d292
  27:     0x55c84f59ae0a - core::ops::function::FnOnce::call_once{{vtable.shim}}::hce3210c75fe85616
  28:     0x55c84fce24af - std::sys::thread::unix::Thread::new::thread_start::h6ea26e7622e6e954
  29:     0x7d9cd5c9caa4 - start_thread
                               at ./nptl/pthread_create.c:447:8
  30:     0x7d9cd5d29c6c - clone3
                               at ./misc/../sysdeps/unix/sysv/linux/x86_64/clone3.S:78:0
  31:                0x0 - <unknown>

thread 'tests::guest_prove_simple' (1400376) panicked at library/core/src/panicking.rs:236:5:
panic in a destructor during cleanup
thread caused non-unwinding panic. aborting.
error: test failed, to rerun pass `-p powdr-openvm --lib`

Caused by:
  process didn't exit successfully: `/home/steve/powdr/target/release/deps/powdr_openvm-39c4e0f6b0554f48 'tests::guest_prove_simple' --exact --nocapture` (signal: 6, SIGABRT: process abort signal)

Comment on lines +101 to +104
limbs.write_new(i, limb_u32);
if (!limbs.is_apc) {
add_count(limb_u32, min(bits_remaining, range_max_bits));
}
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the only decompose_new difference from decompose, which has:

limbs[i] = limb_u32;
add_count(limb_u32, min(bits_remaining, range_max_bits));

const size_t lower_decomp_len,
RowSliceNew lower_decomp
) {
rc.decompose_new(y - x - 1, max_bits, lower_decomp, lower_decomp_len);
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No difference from generate_subrow_new except that we use decompose_new.

}

__device__ void fill_new(RowSliceNew row, uint32_t prev_timestamp, uint32_t timestamp) {
AssertLessThan::generate_subrow_new(
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The same as fill except that we use generate_subrow_new here.

Comment on lines +248 to +250
if (!rs2_aux.is_apc) {
bitwise_lookup.add_range(record.rs2 & mask, (record.rs2 >> RV32_CELL_BITS) & mask);
}
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only difference from adapter.fill_trace_row is here, where we filter side effects by is_apc.

All other differences are using the new versions of APIs.

@qwang98 qwang98 marked this pull request as ready for review December 2, 2025 11:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants