-
Notifications
You must be signed in to change notification settings - Fork 29
feat(vcs): new data model #192
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: vcs-staging
Are you sure you want to change the base?
feat(vcs): new data model #192
Conversation
* Updated the data model to accommodate the new generic approach to VCS integration. This involves renaming the `github_...` tables to `vcs_...`, adding a new column to the relevant tables to identify which provider the records relate to, and more. * Added an Alembic migration, including moving the repository data from `oauthclient_remoteaccount` to the `vcs_repositories` table, which is a complex and long-running operation. This will be supplemented by a manual migration guide for instances like Zenodo where a several-minute full DB lock is not feasible. The difference between whether to use the automated migration or the manual one will be clarified in the docs. * Added a repo-user m-to-m mapping table. By not storing repos in the Remote Accounts table, we need a different way of associating users with the repos they have access to. This table is synced using code that will be included in other PRs. * This PR contains only the data model changes themselves and not the associated functional changes needed to do anything useful. * This commit on its own is UNRELEASABLE. We will merge multiple commits related to the VCS upgrade into the `vcs-staging` branch and then merge them all into `master` once we have a fully release-ready prototype. At that point, we will create a squash commit.
9f1e07b to
449f41d
Compare
79ba5a6 to
fc8faf7
Compare
fc8faf7 to
66c42c0
Compare
zzacharo
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@palkerecsenyi I dont see something worrying but it is also difficult without having the tests and the functional usage of the model... We might need to revisit that across the next PRs. @slint any major thing you see?
| "provider_id", | ||
| name="uq_vcs_repositories_provider_provider_id", | ||
| ), | ||
| # Index("ix_vcs_repositories_provider_provider_id", "provider", "provider_id"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
leftover?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I commented this because I wasn't 100% sure about the indexes/uniques. I'm fairly certain I've arranged them correctly for the new models but I'm not super experienced with these so I'm not sure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hard to say without seeing the rest of the code. If there are a high number of rows and we filter by provider_id, we should probably do it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right now we have these indexes:
vcs_repositories:- Primary key on ID
- Unique index
uq_vcs_repositories_provider_provider_idon(provider, provider_id)since each provider must supply unique IDs for repos. When querying repos, we always query by provider and either provider ID or full name. And there are definitely a very high number of rows, so this makes sense for performance. Unique indexWe were hardly using this so I've removed it.uq_vcs_repositories_provider_nameon(provider, name).
vcs_releases- Primary key on ID
- Unique index
uq_vcs_releases_provider_id_provideron(provider, provider_id). We also query releases by provider and provider ID. - Non-unique index
ix_vcs_releases_record_idonrecord_idwhich already exists before the migration
vcs_repository_users- Primary key on both of the IDs
Yes indeed it's quite an annoying way to review sadly. If it helps, you can see the non-fragmented diff of all the code on the For example the |
| existing_type=sa.Integer(), | ||
| existing_nullable=True, | ||
| ) | ||
| op.alter_column( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is the purpose of this column and why we modify it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This stores the provider-specific ID of the webhook if the repository has been activated. On GitHub, this ID is always an integer so until now we have stored it as an integer. It also happens to be an integer on GitLab. But we don't know that this will be the case for all VCSes so we change it here to be stored as a string which is more flexible.
| op.add_column( | ||
| "vcs_repositories", | ||
| sa.Column( | ||
| "default_branch", sa.String(255), nullable=False, server_default="master" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how is this information used later on? why do we need to specify the default branch?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need it to be able to generate a new file link for creating the CITATION.cff file, which is shown in the UI as a matter of convenience. It's not absolutely essential though.
invenio_vcs/alembic/1754318294_switch_to_generic_git_services.py
Outdated
Show resolved
Hide resolved
invenio_vcs/alembic/1754318294_switch_to_generic_git_services.py
Outdated
Show resolved
Hide resolved
| "vcs_repositories", sa.Column("license_spdx", sa.String(255), nullable=True) | ||
| ) | ||
| op.alter_column("vcs_repositories", "user_id", new_column_name="enabled_by_id") | ||
| op.drop_index("ix_github_repositories_name") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is it OK to drop these indices? especially the id
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We currently have two indexes:
ix_github_repositories_nameon the repo nameix_github_repositories_github_idon the repo's provider (GitHub) ID
We are replacing them with these two:
uq_vcs_repositories_provider_provider_idon the combination ofproviderandprovider_id, since each repository must have a unique ID within the context of a provider.uq_vcs_repositories_provider_namesince each repository must have a unique full name (e.g.inveniosoftware/invenio-github) within the context of a provider.
| op.alter_column( | ||
| "vcs_repositories", | ||
| "github_id", | ||
| new_column_name="provider_id", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess this is the id of the repository supplied by the specific provider?
if this is the case, my first thought was that provider_id means we assign an identifier to a provider (as in (github, 1), (gitlab, 2)... ) so the name of the column might not be descriptive enough to remove the ambiguity...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes I think it's been mentioned before the provider and provider_id have confusing names, so it's probably worth changing them. Maybe id_from_provider instead of provider_id or something similar?
| # | ||
| # We need to recreate the SQLAlchemy models for `RemoteAccount` and `Repository` here but | ||
| # in a much more lightweight way. We cannot simply import the models because (a) they depend | ||
| # on the full Invenio app being initialised and all extensions available and (b) we need |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't fully understand why we replicate oauth remote account, won't this recipe fail the moment we try to upgrade an existing instance? or this is not "really" creating the table?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is just creating an SQLAlchemy model of the table so we can interact with it in a similar way to the rest of our codebase. It doesn't actually attempt to create or modify the table itself.
The alternative is using raw SQL to read/insert rows, which would be confusing for such a complex migration.
| op.create_table( | ||
| "vcs_repository_users", | ||
| sa.Column("repository_id", UUIDType(), primary_key=True), | ||
| sa.Column("user_id", sa.Integer(), primary_key=True), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is it our user_id or user id from the VCS provider? I guess from vcs judging from the FK constraints, but can we be more explicit. Also from our previous experiences, having int type on ids can be very problematic, I would suggest another approach. What if some vcs provider has alphanumeric user ids?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's our ID in this case, this table is storing which Invenio users have access to which repo. The foreign key maps it to accounts_user.id. Hence also why it's an int.
I agree the naming is confusing, maybe accounts_user_id or something similar would work better?
…on for orphaned repos
| "provider_id", | ||
| name="uq_vcs_repositories_provider_provider_id", | ||
| ), | ||
| # Index("ix_vcs_repositories_provider_provider_id", "provider", "provider_id"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hard to say without seeing the rest of the code. If there are a high number of rows and we filter by provider_id, we should probably do it.
| # Relationships | ||
| # | ||
| users = db.relationship(User, secondary=repository_user_association) | ||
| enabled_by_user = db.relationship(User, foreign_keys=[enabled_by_user_id]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does this record the last user who enabled, in case there are multiple disable/enable actions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it records the last user to enable/disable. Currently this means that only the enabled_by_user will have permissions to manage records created from a release, although they can of course manually customise this once the record has been created.
The code for setting this is here: https://github.com/inveniosoftware/invenio-github/pull/194/files#diff-86dd04ea2c2792e65579eca9ba11e1ddf4a3a5bf5aefb53cc7f796d5b02c7e16R386-R389
invenio_vcs/models.py
Outdated
| def add_user(self, user_id: int): | ||
| """Add permission for a user to access the repository.""" | ||
| user = User(id=user_id) | ||
| user = db.session.merge(user) | ||
| self.users.append(user) | ||
|
|
||
| def remove_user(self, user_id: int): | ||
| """Remove permission for a user to access the repository.""" | ||
| user = User(id=user_id) | ||
| user = db.session.merge(user) | ||
| self.users.remove(user) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand these methods, what happens with self.users?
To fetch an existing user, we normally do something like this:
with db.session.no_autoflush:
user = current_datastore.get_user(...)
The no_autoflush is needed because if by any chance the user obj is modified, it will be persisted in the DB.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ahh my intention here was to add the user to the many-to-many vcs_repository_users table (and vice versa to remove it), but without running a SELECT query to get the full user, as that's unnecessary here. However I now notice that the merge function runs the SELECT anyway, so I will try inserting directly to repository_user_association maybe. Or do you think we should just query the user anyway (using current_datastore) to keep the code a little simpler?
invenio_vcs/models.py
Outdated
| if provider_id: | ||
| repo = cls.query.filter( | ||
| Repository.provider_id == provider_id, Repository.provider == provider | ||
| ).one_or_none() | ||
| if not repo and full_name is not None: | ||
| repo = cls.query.filter( | ||
| Repository.full_name == full_name, Repository.provider == provider | ||
| ).one_or_none() | ||
|
|
||
| return repo |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| if provider_id: | |
| repo = cls.query.filter( | |
| Repository.provider_id == provider_id, Repository.provider == provider | |
| ).one_or_none() | |
| if not repo and full_name is not None: | |
| repo = cls.query.filter( | |
| Repository.full_name == full_name, Repository.provider == provider | |
| ).one_or_none() | |
| return repo | |
| if provider_id: | |
| .... | |
| elif not repo and full_name is not None: | |
| ... | |
| else: | |
| raise .... ? | |
| return repo |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would definitely be neater but it also changes the logic a little bit. I was trying to avoid changing the logic too much from how it was:
invenio-github/invenio_github/models.py
Lines 180 to 183 in abedae5
| if github_id: | |
| repo = cls.query.filter(Repository.github_id == github_id).one_or_none() | |
| if not repo and name is not None: | |
| repo = cls.query.filter(Repository.name == name).one_or_none() |
But I am happy to refactor this and do some testing to make sure nothing breaks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we were hardly even using the name-based filtering of repos, I have gone ahead and removed it entirely. This also means we can have one less index on vcs_repositories. Hope that's okay!
|
|
||
| def upgrade(): | ||
| """Upgrade database.""" | ||
| op.rename_table("github_repositories", "vcs_repositories") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Question: given the complexity of this migration, would it be maybe easier to create a new table, copy over the data, and delete the old one instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have considered this and I have recommended creating a new table as the manual migration method for very large instances as it makes it easier to split up the migration into gradual batches without a full DB lock. However, the code is arguably even more complicated (see the new docs in my PR, especially under 'Example script'), and the risk of accidentally losing data due to bugs in the script is slightly higher.
…r` and `remove_user`
These have been moved to a Jinja template
…ndex on vcs_repositories
179515f to
455a2c0
Compare
Closes #188
Updated the data model to accommodate the new generic approach to VCS integration. This involves renaming the
github_...tables tovcs_..., adding a new column to the relevant tables to identify which provider the records relate to, and more.Added an Alembic migration, including moving the repository data from
oauthclient_remoteaccountto thevcs_repositoriestable, which is a complex and long-running operation. This will be supplemented by a manual migration guide for instances like Zenodo where a several-minute full DB lock is not feasible. The difference between whether to use the automated migration or the manual one will be clarified in the docs.Edit: see here for the upgrade guide for large instances.
We can improve the performance of this migration when perf(models): change
extra_datatoJSONBinvenio-oauthclient#360 is merged (assuming users run the migration in that PR before this one). But that's not essential.Added a repo-user m-to-m mapping table. By not storing repos in the Remote Accounts table, we need a different way of associating users with the repos they have access to. This table is synced using code that will be included in other PRs.
This PR contains only the data model changes themselves and not the associated functional changes needed to do anything useful.
This commit on its own is UNRELEASABLE. We will merge multiple commits related to the VCS upgrade into the
vcs-stagingbranch and then merge them all intomasteronce we have a fully release-ready prototype. At that point, we will create a squash commit.