-
Notifications
You must be signed in to change notification settings - Fork 0
[OVM GPU] direct to APC trace gen #50
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: 1.4.1-upgrade
Are you sure you want to change the base?
Conversation
| #define COL_WRITE_VALUE_APC(APC_ROW, STRUCT, FIELD, VALUE, SUB, OFFSET) { | ||
| if SUB[COL_INDEX(STRUCT, FIELD) + OFFSET] != UINT32_MAX { | ||
| (APC_ROW).write( | ||
| COL_INDEX(STRUCT, FIELD) + OFFSET, | ||
| VALUE | ||
| ); | ||
| } | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only write if not optimized away based on SUB.
| if !apc_row.is_null() { | ||
| COL_WRITE_VALUE_APC(apc_row, AluNativeAdapterCols, from_state.timestamp, Fp(rec.from_timestamp), sub, offset); | ||
| COL_WRITE_VALUE_APC(row, AluNativeAdapterCols, a_pointer, Fp::fromRaw(rec.a_ptr), sub, offset); | ||
| COL_WRITE_VALUE_APC(row, AluNativeAdapterCols, b_pointer, Fp::fromRaw(rec.b), sub, offset); | ||
| COL_WRITE_VALUE_APC(row, AluNativeAdapterCols, c_pointer, Fp::fromRaw(rec.c), sub, offset); | ||
|
|
||
| // TODO: adapt the rest similar to above |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not complete but just showing how we use COL_WRITE_VALUE_APC.
|
|
||
| impl Chip<DenseRecordArena, GpuBackend> for Rv32BaseAluChipGpu { | ||
| fn generate_proving_ctx(&self, arena: DenseRecordArena) -> AirProvingContext<GpuBackend> { | ||
| fn generate_proving_ctx(&self, arena: DenseRecordArena, d_apc_trace: DeviceMatrix<F>, subs: Option<Vec<Vec<u32>>>, apc_row_index: Option<Vec<u32>>) -> AirProvingContext<GpuBackend> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
d_apc_trace should be supplied by PowdrTraceGenerator::generate_witness as Some variant and set to None for regular trace gen when no APC involved. Similar for subs and apc_row_index.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
subs will have the same length as # of dummy columns, where entry is indexed by dummy column index and value is APC index. Value is hardcoded to UINT32_MAX if the dummy column is optimized away in APC.
This vector design is needed because GPU has no native map encoding and because this vector should be not too sparse so we don't waste too much space.
| RowSlice apc_row(d_apc_trace + apc_row_index[idx], height); | ||
| auto const sub = subs[idx * width]; // offset the subs to the corresponding dummy row |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here we need the sub for a specific dummy row, because each row might be optimized differently within an APC.
The apc_row should also be specific to the dummy row, because dummy trace gen can involve more than one APC row.
| COL_WRITE_VALUE(row, AluNativeAdapterCols, reads_aux[i].is_immediate, Fp::one()); | ||
| if (i == 0) { | ||
| COL_WRITE_VALUE(row, AluNativeAdapterCols, e_as, Fp(AS_IMMEDIATE)); | ||
| __device__ void fill_trace_row(RowSlice row, AluNativeAdapterRecord const &rec, RowSlice apc_row, uint32_t *sub, uint32_t offset) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is stopping us from passing row and apc_row as the same argument? Then it removes the conditional in the body? Also COL_WRITE_VALUE_APC and COL_WRITE_VALUE can be unified?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah that's a good point re unifying row and apc_row.
I think main point on the COL_WRITE_VALUE_APC is speed? Basically one will go through subs and the other won't, but this is also at the cost of a larger diff.
| COL_WRITE_ARRAY(row, AluNativeAdapterCols, write_aux.prev_data, rec.write_aux.prev_data); | ||
| mem_helper.fill( | ||
| row.slice_from(COL_INDEX(AluNativeAdapterCols, write_aux)), | ||
| rec.write_aux.prev_timestamp, | ||
| rec.from_timestamp + 2 | ||
| ); | ||
| } | ||
|
|
||
| COL_WRITE_ARRAY(row, AluNativeAdapterCols, write_aux.prev_data, rec.write_aux.prev_data); | ||
| mem_helper.fill( | ||
| row.slice_from(COL_INDEX(AluNativeAdapterCols, write_aux)), | ||
| rec.write_aux.prev_timestamp, | ||
| rec.from_timestamp + 2 | ||
| ); | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This also needs to go through the subs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes exactly. Haven't modified this yet, but will be a similar style as COL_WRITE_VALUE_APC
| __device__ void fill_trace_row(RowSlice row, AluNativeAdapterRecord const &rec, RowSlice apc_row, uint32_t *sub, uint32_t offset) { | ||
| if !apc_row.is_null() { | ||
| COL_WRITE_VALUE_APC(apc_row, AluNativeAdapterCols, from_state.timestamp, Fp(rec.from_timestamp), sub, offset); | ||
| COL_WRITE_VALUE_APC(row, AluNativeAdapterCols, a_pointer, Fp::fromRaw(rec.a_ptr), sub, offset); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| COL_WRITE_VALUE_APC(row, AluNativeAdapterCols, a_pointer, Fp::fromRaw(rec.a_ptr), sub, offset); | |
| COL_WRITE_VALUE_APC(apc_row, AluNativeAdapterCols, a_pointer, Fp::fromRaw(rec.a_ptr), sub, offset); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
etc
07b6a29 to
5b3078e
Compare
| __device__ __forceinline__ size_t number_of_gaps_in(const uint32_t *sub, size_t len) { | ||
| size_t gaps = 0; | ||
| #pragma unroll | ||
| for (size_t i = 0; i < len; ++i) { | ||
| if (sub[i] == UINT32_MAX) { | ||
| ++gaps; | ||
| } | ||
| } | ||
| return gaps; | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should return 0 for a sub vector without UINT32_MAX.
| uint32_t idx = blockIdx.x * blockDim.x + threadIdx.x; | ||
| RowSlice row(d_trace + idx, height); | ||
| if (idx < d_records.len()) { | ||
| auto const &rec = d_records[idx]; | ||
| // RowSlice apc_row(d_apc_trace + apc_row_index[idx], height); | ||
| // auto const sub = subs[idx * width]; // offset the subs to the corresponding dummy row | ||
| uint32_t *sub = subs; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently subs is just an identity vector, but will involve a lot more precomputation if we want them to work with APC.
| core.fill_trace_row(row.slice_from(COL_INDEX(Rv32BaseAluCols, core)), rec.core); | ||
| core.fill_trace_row_new(row.slice_from(COL_INDEX(Rv32BaseAluCols, core) - number_of_gaps_in(sub, sizeof(Rv32BaseAluCols<uint8_t>))), rec.core, sub); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that this is a very common style patch. Basically whenever we slice an (APC) row, we now slice it but also "retract" it by the number of gaps in the optimization. For non-APC row, we should "retract" nothing, so should be "backward compatible" with non-APC calls of the function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes maybe there's a more natural way to encode it, but if this works we can improve later
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ideally this would be moved into the slice_from function though, and we need to store the offset in the rowslice
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
| let d_records = records.to_device().unwrap(); | ||
| let d_trace = DeviceMatrix::<F>::with_capacity(trace_height, trace_width); | ||
|
|
||
| // TODO: use actual sub not hardcoded |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess for software it would still use the identity
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes.
…ith non-APC but not sure if APC works
| struct RowSliceNew { | ||
| Fp *ptr; | ||
| size_t stride; | ||
| size_t optimized_offset; | ||
| size_t dummy_offset; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added this new RowSliceNew struct that stores the offsets.
optimized_offset is the smaller one, basically the cumulative dummy_offset subtracted by gap.
dummy_offset is the larger one, basically the cumulative COL_INDEX of original columns.
| __device__ __forceinline__ void write_array_new(size_t column_index, size_t length, const T *values, const uint32_t *sub) | ||
| const { | ||
| #pragma unroll | ||
| for (size_t i = 0; i < length; i++) { | ||
| if (sub[i] != UINT32_MAX) { | ||
| ptr[(column_index + i) * stride] = values[i]; | ||
| } | ||
| } | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might need to update this as well but currently don't see a bug.
| /// Conditionally write a single value into `FIELD` based on APC sub-columns. | ||
| #define COL_WRITE_VALUE_NEW(ROW, STRUCT, FIELD, VALUE, SUB) \ | ||
| do { \ | ||
| const size_t _col_idx = COL_INDEX(STRUCT, FIELD); \ | ||
| const auto _apc_idx = (SUB)[_col_idx + ROW.dummy_offset]; \ | ||
| if (_apc_idx != UINT32_MAX) { \ | ||
| (ROW).write(_apc_idx - ROW.optimized_offset, VALUE); \ | ||
| } \ | ||
| } while (0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the other key update that computes relative sub from absolute subs:
- Convert relative
_col_idxto absolutesubindex by addingROW.dummy_offset, because the index is based on original dummy offset. _apc_idxis absolute APC index.- Because now we also row slice APC trace by post-optimization offset, we need relative APC index when writing, and that's why we subtract
ROW.optimized_offsetfrom absolute APC index to obtain the relative APC index.
This should be fully backward compatible, because for non-APC use case, optimized_offset is the same as dummy_offset (as there are no gaps), so adding and subtracting by the same thing is equivalent to using the identity matrix.
| RowSlice row(d_trace + idx, height); | ||
| RowSliceNew row(d_trace + idx / calls_per_apc_row, height, 0, 0); // we need to slice to the correct APC row, but if non-APC it's dividing by 1 and therefore the same idx |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Initialize row with zero for both optimized_offset and dummy_offset.
In the case of non-APC row, calls_per_apc_row is hardcoded to 1, so it's basically the same as d_trace + idx.
In the case of APC row, we integer divide by calls_per_apc_row, so we access to the correct APC slice.
| auto const &rec = d_records[idx]; | ||
| // RowSlice apc_row(d_apc_trace + apc_row_index[idx], height); | ||
| // auto const sub = subs[idx * width]; // offset the subs to the corresponding dummy row | ||
| uint32_t *sub = &subs[(idx % calls_per_apc_row) * width]; // dummy width |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need the correct sub for the instruction, which depends on which record it is in.
All sub should have the same width across instructions, which is convenient.
| __device__ __forceinline__ RowSliceNew slice_from(size_t column_index, uint32_t gap) const { | ||
| return RowSliceNew(ptr + (column_index - gap) * stride, stride, optimized_offset + column_index - gap, dummy_offset + column_index); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is another key update:
- In case of APC, we advance by
column_index - gap, and accumulateoptimized_offsetby the post optimization width (column_index - gap) whereas we accumulatedummy_offsetby the pre optimization width (column_index). - In case of non-APC, we advance by
column_index(becausegapis 0). For the same reason, we advanceoptimized_offsetanddummy_offsetbycolumn_index.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if we can make this function take the same arguments as originally (only the column_index). Could it figure out the gap from the substitutions if it also had that internally? That would minimize the changes needed in the original chips.
| /// Write a single value into `FIELD` of struct `STRUCT<T>` at a given row. | ||
| #define COL_WRITE_VALUE(ROW, STRUCT, FIELD, VALUE) (ROW).write(COL_INDEX(STRUCT, FIELD), VALUE) | ||
|
|
||
| /// Conditionally write a single value into `FIELD` based on APC sub-columns. | ||
| #define COL_WRITE_VALUE_NEW(ROW, STRUCT, FIELD, VALUE, SUB) \ | ||
| do { \ | ||
| const size_t _col_idx = COL_INDEX(STRUCT, FIELD); \ | ||
| const auto _apc_idx = (SUB)[_col_idx + ROW.dummy_offset]; \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here: if the sub is stored in the RowSlice, we can keep the same interface for COL_WRITE_VALUE etc
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point! :)
Will implement once the current version works end to end, so that other chips will have minimal changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is partially implemented, but I need to clean up by removing SUB from some APIs.
| ); | ||
| } | ||
|
|
||
| __device__ void fill_trace_row_new(RowSliceNew row, Rv32BaseAluAdapterRecord record, uint32_t *sub) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
then here we don't have to pass the subs everywhere and the whole thing would be unchanged?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as above.
26f3760 to
87ff97c
Compare
|
Despite what the latest commit message suggests, it's now confirmed that the memory alignment error is non deterministic depending on the run, which makes it even more perplexing, though it also suggests that main trace is actually correct. However, the "fix" at least works for some runs, which also means that it's not a "full fix" yet... The error only happens during
Here's the very cryptic error: |
|
See my answers to the debugging ideas I proposed. Unfortuantely they didn't help solve the bug. A few more things I tried today:
This "minimum change test vector" is here: powdr-labs/powdr#3458
|
| limbs.write_new(i, limb_u32); | ||
| if (!limbs.is_apc) { | ||
| add_count(limb_u32, min(bits_remaining, range_max_bits)); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the only decompose_new difference from decompose, which has:
limbs[i] = limb_u32;
add_count(limb_u32, min(bits_remaining, range_max_bits));
| const size_t lower_decomp_len, | ||
| RowSliceNew lower_decomp | ||
| ) { | ||
| rc.decompose_new(y - x - 1, max_bits, lower_decomp, lower_decomp_len); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No difference from generate_subrow_new except that we use decompose_new.
| } | ||
|
|
||
| __device__ void fill_new(RowSliceNew row, uint32_t prev_timestamp, uint32_t timestamp) { | ||
| AssertLessThan::generate_subrow_new( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The same as fill except that we use generate_subrow_new here.
| if (!rs2_aux.is_apc) { | ||
| bitwise_lookup.add_range(record.rs2 & mask, (record.rs2 >> RV32_CELL_BITS) & mask); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The only difference from adapter.fill_trace_row is here, where we filter side effects by is_apc.
All other differences are using the new versions of APIs.
WIP. Currently just a prototype for ALU chip on the OVM side. If we were to go down this route, a similar modification for all chips is needed.
Passes
guest_prove_simpleNON-DETERMINISTICALLY in some runs, likely due to a bug that's more on the Powdr end than on OVM end: powdr-labs/powdr#3458. Keccak test entirely fails, likely due to the same bug.Current draft represents minimum change to the chip implementations, by pushing API changes as much as possible to
RowSliceNew. The only changes to chips are:RowSliceaccording to additional injected device inputs from CPU.add_count) whetherrow_slice.is_apcis true.Otherwise, most APIs are simply replaced with their
_newversion, without any changes to the arguments.TODOs:
There's nothing static added to this yet, as currently everything (including whether we take the APC witgen path vs original instruction path for ALU chip for current call to
genenerate_proving_ctx) is run time values.