-
Notifications
You must be signed in to change notification settings - Fork 123
[DO NOT MERGE] minimum change reproducing the CUDA memory bug in direct to APC #3458
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| // let mut output = DeviceMatrix::<BabyBear>::with_capacity(height, width); | ||
| use openvm_stark_backend::p3_field::FieldAlgebra; | ||
| let zeros = vec![BabyBear::ZERO; height * width]; | ||
| let device_buffer = zeros | ||
| .to_device() | ||
| .expect("copy zero trace to device failed"); | ||
| println!("output len: {}", device_buffer.len()); | ||
| let mut output = | ||
| DeviceMatrix::<BabyBear>::new(Arc::new(device_buffer), height, width); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Zero'ing out the output buffer so that we are sure any illegal memory access isn't due to the cells being empty.
| if air_name == "VmAirWrapper<Rv32BaseAluAdapterAir, BaseAluCoreAir<4, 8>" { | ||
| return None; | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Skip dummy trace generation.
| if *air_name == "VmAirWrapper<Rv32BaseAluAdapterAir, BaseAluCoreAir<4, 8>" { | ||
| return (airs, substitutions) | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Skip creating Subst, so dummy trace aren't written to APC trace and we have an incorrect APC trace, which should normally panic in prover but in some runs it panics with CUDA memory access at RangeTupleCheckerChipGPU, which is exactly the bug this PR's tries to reproduce.
|
Close as fixed. |
Reproduces the bug as described in this comment: powdr-labs/openvm#50 (comment)
Note that the CUDA memory error "very unrelatedly" happens on
RangeTupleChecker.It also non-deterministically happens in some runs only (might need to run
guest_prove_simplea few times before it's encountered).