-
Notifications
You must be signed in to change notification settings - Fork 81
[feat] integrate smtforest, avoid ser/de of full account/vault data in database #1394
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: next
Are you sure you want to change the base?
Conversation
7980bed to
771f6d7
Compare
771f6d7 to
b2aa54b
Compare
b2aa54b to
2964a93
Compare
|
|
||
| /// Updates storage map SMTs in the forest for changed accounts | ||
| #[allow(clippy::too_many_lines)] | ||
| async fn update_storage_maps_in_forest( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewing this one would be very helpful
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks okay with a light pass; I'll try more tomorrow.
Mirko-von-Leipzig
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mostly questions; no real issues found as yet
|
|
||
| /// Updates storage map SMTs in the forest for changed accounts | ||
| #[allow(clippy::too_many_lines)] | ||
| async fn update_storage_maps_in_forest( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks okay with a light pass; I'll try more tomorrow.
| account_commitment BLOB NOT NULL, | ||
| code_commitment BLOB, | ||
| storage BLOB, | ||
| vault BLOB, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Question: what is the benefit of storing vault and storage headers in separate tables? I was thinking this would work as follows:
vaultbecomes justvault_root. This should be strictly better than introducing theaccount_vault_headerstable.storagebecomesstorage_header- a blob with serialized storage header. Depending on how we use this data, this may or may not be a good idea. Or we may also need to add something likestorage_commitment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think using the unversioned, implementation defined serialization format and blob fields makes sense/ does ask for future trouble. Given the number of fields, it seemed logical to me to move to a separate table. vault followed in symmetry, but given it's only a single value I can also just inline it into the accounts table again.
My bigger concern was on how to match the tables; is (account_id, block_num) a good primary? Good enough, good in absolute terms, maybe, so I think adding a storage_commitment-columns as part of the accounts table might make sense, similar to account_code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think using the unversioned, implementation defined serialization format and blob fields asks for future trouble. Given the number of fields, it seemed logical to me to move to a separate table.
vaultfollowed in symmetry, but given it's only a single value I can also just inline it into theaccountstable again.
This makes sense, but I do think splitting this into multiple tables comes with quite a bit of extra complexity (querying, insertion, cleanup) and performance overhead (the datasets will be pretty large, and I'm assuming joining tables would not be negligible).
AFICT, putting vault_root into accounts comes without any downsides - so, I think we should do it.
For storage header, I agree that using serialization format defined in miden-base introduces some risk, but I think a simpler way to handle it is to write custom serialization/deserialization for AccountStorageHeader here. This would give us the same amount of control, but would reduce complexity and improve performance.
I don't think we need to write this custom serialization in this PR. I'd do the following:
- Replace the table with
storage_headerfields that would be a simple serialization ofAccountStorageHeader. - Create a follow-up issue to write the custom serialization/deserialization logic for
AccountStorageHeader.
| /// Represents a list of assets, if the number of assets is reasonably small, which | ||
| /// is currently set to 1000 for no particular reason. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently, every asset is 32 bytes - though, we are likely to increase this to 64 bytes in the future. So, 1K assets would be about 64KB of data. It feels a bit high - so, maybe we should reduce this, but I don't have good rationale to what exactly.
| let too_many_assets = entries.len() > Self::MAX_RETURN_ENTRIES; | ||
|
|
||
| if too_many_assets { | ||
| return Ok(Self::too_many()); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not for this PR, but we should probably have a way not to read all assets from the database in case the number of assets is too big. So, as a future optimization, we could also store num_assets in the accounts table (next to vault_root) - though, it will require updating this filed as the number of assets changes, which will not be trivial.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be easier to add a subquery with LIMIT (n+1) clause where n is our actual desired limit to mitigate large~ish memory overhead.
crates/proto/src/domain/account.rs
Outdated
| /// ## Future Enhancement (TODO) | ||
| /// | ||
| /// Currently, when `too_many_entries = true`, we return an empty list. A future improvement | ||
| /// would return a **partial SMT** with: | ||
| /// - A subset of entries (e.g., most frequently accessed) | ||
| /// - Merkle proofs for those entries | ||
| /// - Inner node commitments | ||
| /// | ||
| /// This would allow clients to verify partial data cryptographically while still | ||
| /// signaling that more data exists. The reason this matters: if all leaf values are | ||
| /// included, one can reconstruct the entire SMT; if even one is missing, one cannot. | ||
| /// By providing proofs, we enable verification of partial data. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure I'd put this into comments - but we should create an issue for this (unless we already have one).
Overall, I don't think we'll proactively try to return a part of the map, but rather would return a partial map when the user requests values for specific keys.
| let too_many_entries = map_entries.len() > Self::MAX_RETURN_ENTRIES; | ||
| let map_entries = if too_many_entries { Vec::new() } else { map_entries }; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar to one of the above comments: we should probably optimize this in the future to avoid reading full maps from the database when there are too many entries - but that would be even more tricky than doing this for the asset vault.
| /// This reconstructs the `AccountHeader` by joining multiple tables: | ||
| /// - `accounts` table for `account_id`, `nonce`, `code_commitment` | ||
| /// - `account_vault_headers` table for `vault_root` | ||
| /// - `account_storage_headers` table for storage slot commitments (to compute `storage_commitment`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Related to some of the above comments: ideally, we'd be able to read the full account header just from the accounts. This should simplify queries like this - and I'm not seeing downsides to doing this.
| /// Queries the account code for a specific account at a specific block number. | ||
| /// | ||
| /// Returns `None` if: | ||
| /// - The account doesn't exist at that block | ||
| /// - The account has no code (private account or account without code commitment) | ||
| /// | ||
| /// # Arguments | ||
| /// | ||
| /// * `conn` - Database connection | ||
| /// * `account_id` - The account ID to query | ||
| /// * `block_num` - The block number at which to query the account code | ||
| /// | ||
| /// # Returns | ||
| /// | ||
| /// * `Ok(Some(Vec<u8>))` - The account code bytes if found | ||
| /// * `Ok(None)` - If account doesn't exist or has no code | ||
| /// * `Err(DatabaseError)` - If there's a database error | ||
| pub(crate) fn select_account_code_at_block( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would be awesome to add SQL statements describing these queries (as we do for some similar methods).
| Ok(result) | ||
| } | ||
|
|
||
| /// Queries the account header for a specific account at a specific block number. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure this description is correct. I think what we are getting here is the first account header with block number which is smaller than or equal to the block_num parameter - right?
| fn compute_storage_commitment(slot_commitments: &[Word]) -> Word { | ||
| use miden_objects::crypto::hash::rpo::Rpo256; | ||
|
|
||
| let elements: Vec<Felt> = slot_commitments.iter().flat_map(|w| w.iter()).copied().collect(); | ||
|
|
||
| Rpo256::hash_elements(&elements) | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should try to figure out how to use AccountStorageHeader here - otherwise this will be quite brittle.
For example, we could store the result of AccountStorageHeader::to_vec() in accounts.storage_header, and then, we could deserialize the bytes into AccountStorageHeader and get the commitment from it directly.
| let raw: Vec<(Vec<u8>, Option<Vec<u8>>)> = SelectDsl::select( | ||
| t::table | ||
| .filter(t::account_id.eq(&account_id_bytes)) | ||
| .filter(t::block_num.le(block_num_sql)) | ||
| .order(t::block_num.desc()) | ||
| .limit(1), | ||
| (t::vault_key, t::asset), | ||
| ) | ||
| .load(conn)?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would this not always return just 1 entry (because of LIMIT 1)? I think we may need to get all entries with block_num smaller than or equal to the passed in block number and then perform filtering manually - but maybe there is a better way to do it.
| // Query latest storage headers for this account | ||
| let headers: Vec<AccountStorageHeaderRaw> = | ||
| SelectDsl::select(t::table, AccountStorageHeaderRaw::as_select()) | ||
| .filter(t::account_id.eq(&account_id_bytes).and(t::is_latest.eq(true))) | ||
| .order(t::slot_index.asc()) | ||
| .load(conn)?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As mentioned in the previous comments, I would probably read the storage header directly from the accounts table.
| // Query all entries for this slot at or before the given block | ||
| let raw: Vec<(Vec<u8>, Vec<u8>)> = SelectDsl::select(t::table, (t::key, t::value)) | ||
| .filter( | ||
| t::account_id | ||
| .eq(&account_id_bytes) | ||
| .and(t::slot.eq(slot_sql)) | ||
| .and(t::block_num.le(block_num_sql)), | ||
| ) | ||
| .load(conn)?; | ||
|
|
||
| // Parse entries | ||
| let entries: Vec<(Word, Word)> = raw | ||
| .into_iter() | ||
| .map(|(k, v)| Ok((Word::read_from_bytes(&k)?, Word::read_from_bytes(&v)?))) | ||
| .collect::<Result<Vec<_>, DatabaseError>>()?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wouldn't this result in potentially multiple entries being read from the database? This by itself is fine (or at least I don't see a way to avoid it), but I think we do need to filter the entries here to make sure we don't have more than one entry per block (i.e., we should select the last entry for each key).
bobbinth
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good! Thank you! I left some comments above, but the main things are:
- It is not clear to me if there is much value in having
account_storage_headersandaccount_vault_headerstables. I think we can store the headers directly in theaccountstable which should simplify the code and make querying more efficient. - I think some of the queries may not be working correctly - specifically the ones that try to reconstruct account vault or storage map at a given block. The main reason here is that I think we need to read more data from the DB and then filter it in memory to remove potential duplicates. But it is also possible that I missed something.
| async fn update_vaults_in_forest( | ||
| &self, | ||
| changed_account_ids: &[AccountId], | ||
| block_num: BlockNumber, | ||
| ) -> Result<(), ApplyBlockError> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be possible to move the InnerForest struct into a separate file and then also attach these methods to that struct? For example, something like:
impl InnerForest {
pub async fn update_vaults(
&self,
db: Arc<Db>,
changed_account_ids: &[AccountId],
block_num: BlockNumber,
) -> Result<..> {
...
}
}
Part 2/3 of #1185 and lays the ground for #1354
Intent
AccountInfocore changes
vault/storage_mapBLOB entries from the accounts table.SmtForestand integrate intoapply_blockandState::loadsignificant changes in the following files:
crates/store/src/db/schema.rsintroducesaccount_storage_headersand removesstorage(the full serialized account storage) fromaccountstablecrates/store/src/state.rs/fn State::apply_blocknow updates the database, but also the lookup table of roots for theSmtForestand the entries in the forest itself, indirect lookup tablesout of scope
LargeSmtForestfor partial storage maps - [devops] Deploy without nuking db #670GetAccountProofendpoint #1354(Large|)SmtForest- Add cleanup routine for outdated historical storage entries #1355how to review
Take the existing TODOs into consideration, if they make sense. This will be the follow up PR.