Skip to content

Conversation

@bnaecker
Copy link
Collaborator

  • Adds model types for Subnet Pools, Subnet Pool Members, links between Subnet Pools and Silos, and External Subnets.
  • Add queries for CRUD on all four, which enforce our invariants like no overlapping IP members or subnets, and do allocation for External Subnets from an explicitly-requested subnet, or automatically from a pool. Pools can be specified either by name / ID or an optional IP version. In the later cases, the Subnet Pool is chosen from the pools linked to the silo.
  • Adds some scaffolding useful for coming integration with the public API, such as authz types and checks and create-time parameter types.
  • Closes Database models and queries for external subnets #9578

bnaecker and others added 3 commits January 18, 2026 13:20
- Adds model types for Subnet Pools, Subnet Pool Members, links between
  Subnet Pools and Silos, and External Subnets.
- Add queries for CRUD on all four, which enforce our invariants like no
  overlapping IP members or subnets, and do allocation for External
  Subnets from an explicitly-requested subnet, or automatically from a
  pool. Pools can be specified either by name / ID or an optional IP
  version. In the later cases, the Subnet Pool is chosen from the pools
  linked to the silo.
- Adds some scaffolding useful for coming integration with the public
  API, such as authz types and checks and create-time parameter types.
- Closes #9578
@bnaecker
Copy link
Collaborator Author

Excepting a CI flake, this is ready for review. The bad news is this PR is pretty big. The good is that it's mostly tests. The SQL involved is pretty subtle, so I went out of the way test it as much as I could.

Comment on lines +1248 to +1254
impl oso::PolarClass for SubnetPoolList {
fn get_polar_class_builder() -> oso::ClassBuilder<Self> {
oso::Class::builder()
.with_equality_check()
.add_attribute_getter("fleet", |_: &SubnetPoolList| FLEET)
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You don't need to add a comment here or anything, since this looks like it's probably really straightforward if you know some basics, but could you give me (or point me at) a quick tutorial on Oso sometime?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dave Pacheco would be the best person for detailed questions, but we have a debugging guide here and some general background in the block comment starting here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm guessing these are some sort of schema update fragments? Is it like each DDL statement has to be in its own file?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that's right. We have some docs here, but the gist is that you write each individual step in the update in its own file, and we run them under in a transaction during an upgrade. We have a test that ensures running all the updates results in the same schema as the overall dbinit.sql file.

Comment on lines +679 to +681
let Error::InvalidRequest { message } = &err else {
panic!("expected a NotFound error, found {err:#?}");
};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you use assert_matches! instead of panic here?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah definitely. I like this formulation since it destructures the returned type too, but either would work.

Comment on lines +871 to +883
let _ = datastore
.unlink_subnet_pool_from_silo(
opctx,
&authz_pool,
&pool,
&authz_silo,
)
.await
.expect("able to unlink pool from silo");
let _ = datastore
.link_subnet_pool_to_silo(opctx, &new_authz_pool, &authz_silo, true)
.await
.expect("able to link pool to silo after unlinking other");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rust style question: would it be (marginally) better to have tests like this return Result, so we could just write:

datastore.unlink_subnet_pool_from_silo(...).await?;

Or is it preferable to do the let _ = [...].expect() for a more explicit test failure message?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We generally do the latter, to get more contextual information about the failure and why we didn't expect it. But you can definitely write tests that return a Result directly. I believe the test harness will unwrap it in that case, but I'm not sure.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a way to make sure we're using the expected indexes for these queries? Can we EXPLAIN ANALYZE or something with CRDB?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do that sometimes, yes. We have a trait for explaining a query and getting the results. The main way we enforce using indexes is that all queries are forbidden to result in a plan with a full table or index scan. (The exception here is that you can disable this for data migrations during a schema upgrade.) This doesn't depend on the size of the table. No full scans at any point, of anything.

I use that ExplainableAsync trait in this PR, in the can_explain_insert_external_subnet_from_explicit_ip_query test.

Comment on lines +297 to +299
/// This query is pretty complicated. First, we support creating a subnet in a
/// few different ways, such as by an explicit IP subnet, automatically from a
/// specific pool, or automatically from a pool linked to the current Silo.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to express all cases in one query, or is possible to split them up?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess it depends on what you mean, but there really are four separate queries generated by this function:

  • requesting by explicit subnet
  • requesting by explicit pool (name or ID)
  • requesting by default pool of specific version
  • requesting the default pool, of which there has to be only one.

If you mean, "can we split up all these CTEs into different queries", then...you could, at the cost of an explicit interactive transaction. We try to avoid that, but it's obviously a tradeoff. This query is big and complicated, and the contention we might run into could be worth a simpler query. I personally try very hard to avoid explicit transactions, but I'm probably overly-sensitive to that.

Copy link
Contributor

@FelixMcFelix FelixMcFelix left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Ben, I have a few higher-level questions but otherwise this looks like it's handling some very tricky problems correctly. Thanks a lot for the robust testing here, particularly around gap sizing and allocation.

Comment on lines 295 to 307
/// Create a new Subnet Pool Member.
///
/// IP subnets must be unique across all Subnet Pool Members, in all pools.
/// Any request to create a member with an overlapping IP subnet will fail.
pub async fn create_subnet_pool_member(
&self,
opctx: &OpContext,
authz_pool: &authz::SubnetPool,
params: &params::SubnetPoolMemberCreate,
) -> CreateResult<SubnetPoolMember> {
opctx.authorize(authz::Action::CreateChild, authz_pool).await?;
let member = SubnetPoolMember::new(params, authz_pool.id())?;
insert_subnet_pool_member_query(&member)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We'll also need to prevent overlap with any IP pools, otherwise we might end up with some very strange NAT table behaviour at the switch. That ask is probably more relevant to the mechanics of insert_subnet_pool_member_query.

From the outside I had thought that Pool-typed AddressLots would be the central point for resource allocation between subnet pools and IP pools. But some digging pointed me at #3448, so that wouldn't work without an almighty migration that is 100% not in scope (and probably would be wasted given the rework proposed in RFD533). Practically I think this means that insert_subnet_pool_member_query and queries::ip_pool::FilterOverlappingIpRanges need to account for both of these tables?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oof good catch. I will see about adding that to both of these queries. Depending on how bad that is, I might defer that to a follow-up PR. This one is already enormous.

authz_project: &authz::Project,
params: params::ExternalSubnetCreate,
) -> CreateResult<ExternalSubnet> {
opctx.authorize(authz::Action::CreateChild, authz_project).await?;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we also require CreateChild permissions on the pool that the subnet is drawn from, in keeping with the EIP model?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would be nice, but I don't think we can directly do that. We don't know which pool is being drawn from until we run the query, at least in the current formulation. The only case I'm aware of doing that in the EIP case is when allocating a service IP. Today at least, we know all the pools that could come from and so we can do that before running the allocation query.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, thanks for the explanation there.

- Fix a couple of index and schema update fragment bugs
- Comments on some subtle / confusing database details
- Handle overflow at the very end of IP address space
- Remove `silo_id` from the `external_subnet` table.
- Various typo fixes
@bnaecker
Copy link
Collaborator Author

Thanks @FelixMcFelix and @mergeconflict for the thoughtful reviews. I've addressed almost all the comments. There are two outstanding items.

  • Check for overlap between IP Pools and Subnet Pools when inserting either. Add a test on both sides.
  • Make a new Nexus API version for the changed ExternalSubnetAllocator variants that disallow both a subnet and a pool, since the former fully-constrains the latter.

Sorry to say both of these will add even more the PR. At least its mostly generated code? 😬

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Database models and queries for external subnets

4 participants