-
Notifications
You must be signed in to change notification settings - Fork 9
Description
Right now, cargo sends a lot of data which is discarded with each publish given that the index data tracked by git does not provide storage for it.
For example, one big benefit towards the usability of Estuary's frontend would be to capture the readme content, description, keywords, links to git repos, documentation sites, and timestamps for when the publish occurred.
In addition to storing this content, many databases offer extensions that can be leveraged for full-text search. This could be used to improve the features, storage, and overall performance of the search endpoint (#23).
A loose design goal should be to offer a simple deployment option not requiring an external database (if possible), but plan for the latter as a secondary option. The rationale being storage options like SQLite are not ideal for deployments involving shared disk, which can impact how you'd deploy Estuary (ex: taking steps to ensure more than one running instance isn't looking at the database concurrently).
At this point, I'd say we should pursue something along the lines of SQLite with postgres as a follow-up. We should briefly do a search to see if there's a "better than SQLite option." I'm not sure if sled offers disk persistence or if it's purely in-memory as I believed it to be, for example. In JVM space, H2 is a thing - maybe rust has a driver available. There might be others.
There might be other options to investigate for remote databases, such as redis, but we'll look at those later still.
With this in mind, the modules in Estuary's source should be arranged with feature selection in mind.
Private indexes for other languages such as verdaccio and devpi appear to use "files on disk" database implementations with similar deployment concerns as you'd see with SQLite, so I imagine this is a decent place to start.