-
Notifications
You must be signed in to change notification settings - Fork 0
Description
I started this issue thinking I knew what I wanted to do, but realised that what I was proposing wasn't really solving the problem we have (or might have in future).
If we devolve sysadmin for experiment creation to more than one person, it would be helpful if they had different credentials, to reduce the blast radius if one of them leaks a secret. Only the affected secret would need to be rolled over.
Spinning up separate instances to support separate credentials might not that efficient, e.g. if someone only has a few experiments they are creating, but nonetheless needs to have the ability to generate their own credentials for them.
So supporting multiple credentials would be useful. This is also fairly standard, having multiple well known keys that are permitted to access a system.
Credential rollover for jump connections themselves is also a consideration - since a broken connection prevents administration, it is better to make a new connection before breaking the old one.
There are some challenges - our current jump server only supports one credential. Spinning up a second server is necessary. What if we want to still use the old server address? Then the new server is just temporary, and we use it to reconfigure the original connection.
An alternative would be to have a jump host on the experiment that knows about old and new credentials, and can be configured with new credentials that it will try, but if it cannot connect, it will fall back to making a connection with the old credentials. Theoretically, it could have a list of several credentials it could try, although any that have gone stale should be removed once the new connection is up, so that an attacker doesn't somehow reinstate an old server, DoS the new one, and gain access to the experiments when they fall back to a compromised set of credentials. There is therefore a concept of a prioritisation or order of the list of credentials. And the risk that modifying the service configuration to add the new credential breaks the configuration e.g. with a typo, so the fall back doesn't work, because the service is misconfigured with a typo and can't even try. This could be avoided if there was a way to update the configuration and verify it, using code that was tested (e.g. not have the service depend on the correct specification of the credentials and have the credentials stored outside of the env file that is used to start the service).
Possibly we could have a connect to verify option? Allow both connections but control at server level which one gets the traffic? But if we can't kill the first connection, waiting for it to die might not be adequate, might need to force it.
We want the experiment to appear with the same topic though, so we don't have to update our admin scripts, and two connections to the same topic on the same server as client host would create a clash. If they are namespace, the admin scripts would have to understand how to refer to the new namespace, and that means they need to understand credential rollover, which should be outside their scope of interest.
This needs more thought .....
----- other draft thoughts I had.
Credential rollover is a task we need to support, in case of leaked secrets.
Since the jump connection is critical to our ability to administer experiments, we need to be able to ensure the new connection works before we close the old one, else we have to visit the experiment and manually fix it over a direct connection.
have a known good connection at all times. Making a new connection does not always work, so we have to create the new one while keeping the old one alive.
Currently the jump server supports a single secret. Spinning up a second server is necessary for a safe credential rollover, and involves a number of steps:
- start new temporary server with new address and credential
- log into experiment with old jump connection, and create new temporary connection to the new temporary jump server
- log into experiment with the new temporary jump connection, and reconfigure the old connection
- restart the permanent server with the new credential
- check the permanent connection works with the new credential
This requires the whole fleet to be set up with the new temporary connection before restarting the old server.
An alternative would be to support old and new credentials at the same time. But, we'd need to change the name of the connection (topic) to avoid a naming conflict.
Changing credentials without restart would avoid inconveniencing other experimental administrators who may be wanting to use the jump server (e.g. if we have multiple tenants).
a) restart jump server and add new credential, whilst supporting old credential (or use API to add new credential)
b) admin creates new temporary jump client that is setup with the new credential and new name (e.g. <id>-tmp)
c) admin checks new temporary connection is working
d) admin uses new temporary connection to stand-down the old jump client
e) admin uses the new connection create a new permanent jump client with the old name <id>
f) admin checks new permanent connection is working
g) admin uses new permanent connection to stand down the new temporary connection
h) once all credentials are rolled over, admin restarts jump server with just the new credential or adjusts list of accepted credentials via API
Supporting multiple secrets would allow an easier credential rollover process, because we could make our new connection
, preventing an easy credential rollover process because we can't simultaneously access the same server with the old and new credentials
- jump connections are critical to our ability to admin experiments
- we need to make a new jump connection before we break the old one, just in case it isn't done right
- spinning up a second server just to make a credential rollover doesn't seem super elegant
- we'll not be able to reduce the blast radius by having multiple tenants, so any leak exposes the whole experiment fleet to a rollover task
Since we do all our admin via jump, it is too risky to do a break-before-make change, because if the new connection does not come up, then we have to manually connect to the experiment to fix it. Therefore we need a make-before-break, i.e. to make a second connection before we drop the old one.
With our current system, we can spin up a second temporary server at a different address.
However, that's quite a bit of faff for a credential change.
Better to support multiple credentials so we can just use a single server.
We also need a way to change the credentials while running, so that we don't have to restart the server and break connections that are being used currently.
While we're at it, we may as well implement namespaces for connections to support multiple tenants, i.e. only allow connections to namespaces that are declared as accessible to that credential. This is related to credential roll-over, because having multiple credentials reduces our blast radius should a credential leak.
use because we need to spin up a second server in a different location to allow a make-before-break process.
A better option would be to support two credentials so we can make before break, which goes a bit like this:
Options considered
a) start separate temporary server rather than using the same server (avoids name conflict but requires spinning up new instance)
b) put a time limit on the credentials, e.g. so restart to add new credential, and retire old credential after a given time (doesn't really solve the problem as you'd need to be conservative on how long you left before retiring the old credential in case rollover scripts were slow to apply to experiments, so you'd probably just restart once the job was done to avoid leaving the old credential in action any longer than necessary)
c) allow live update of credential that server works with so that rollover can happen without restarting the jump server, although this would require killing current connections already using the old credential so would need additional testing - we have suitable mechanism in relay for this already
d) a list of multiple acceptable credentials so we can support multiple tenants with a single jump server but not share credentials - in which case restarting jump server is a non-starter as if one sys admin leaks credentials it may require a restart just when another sys admin for another tenant is busy doing urgent admin work.
Having a single name-space also prevents reducing blast radius by requiring all sys admins to use the same jump secret for all their experiments, whereas we might at some point have a separate instance for jump and share it between tenants each using different jump secret. Of course, we'd need a way of name-spacing connections so that one tenant can't access a connection from another name space - we can't use the secret itself to enforce this as that would complicate roll-over.
Add "namespace" field to tokens
Add API to jump that can CRUD credentials. Each credential is added with a POST, including a list of namespace(s) that credential is allowed to access. Behind the scenes, keep the opposite list of namespaces, with a list of acceptable credentials for each namespace
When checking a JWT is valid, check the token is signed with a credential that is on the list for that