Performance improvement plan

As a network operator we deploy in 4U pods of 2 routers and 2 edge servers. Based on recent scaling, we realize we need to shift more computing from the central db to the pods. The central db has become a bottleneck due to massive number of operational queries to run the network (see `server/model`). Our goal is to get to 500k concurrent users per edge server, or 1m concurrent users per pod. 

To realize that goal, we will work on the following:
- [ ] Add a db read replica in every pod. `server.Db` is the current read-only query mode, and will route to the pod-local read replica. Additionally we will change the read-only status to allow temporary tables. This will likely mean adding a new `server.ReadTx`. All `server.Tx` that is read-only except for temp tables will be changed to `server.ReadTx`.
- [ ] Add a change event system to both 1. coordinate changes between the main and read replicas, and 2. allow code to listen for changes to key prefixes and respond. All code that polls the db will move to change events.
- [ ] For cases where we consult a small state "black list" state move these to in memory caches that use change events to keep the cache current. To name a few, api token revoke, valid contract tests, connection client revoke.
- [ ] We currently cannot safely clean up the `network_client` table because if we delete a client_id that will be used at some point in the future, the sdk currently does not recover from that case. We need to change the sdk (device_local) to create a new provider client_id if the current one no longer exists or has errors. This will likely involve a client check and resolution when provider mode is enabled. Additionally the base api in the sdk should move to not having a client_id associated (an admin jwt).

Based on current stats, these changes will relieve ~80% of current DB total time from the central DB, and additionally eliminate ~50% of DB total time from the pod read replicas. Roughly what this means is 6-10x increase in db performance per pod. So if we currently have issues at 20k concurrent, we should get to 200k concurrent with these changes. We will need to address further optimizations towards our goal at that point.

We should be able to add a large number of pods like this, although the usage of of the pod mesh network will grow linearly with number of pods (replication traffic).



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Performance improvement plan #254

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Performance improvement plan #254

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions