feat(vcs): new data model #192

palkerecsenyi · 2025-09-25T10:10:42Z

Closes #188

Updated the data model to accommodate the new generic approach to VCS integration. This involves renaming the github_... tables to vcs_..., adding a new column to the relevant tables to identify which provider the records relate to, and more.
Added an Alembic migration, including moving the repository data from oauthclient_remoteaccount to the vcs_repositories table, which is a complex and long-running operation. This will be supplemented by a manual migration guide for instances like Zenodo where a several-minute full DB lock is not feasible. The difference between whether to use the automated migration or the manual one will be clarified in the docs.
- Edit: see here for the upgrade guide for large instances.
- We can improve the performance of this migration when perf(models): change extra_data to JSONB invenio-oauthclient#360 is merged (assuming users run the migration in that PR before this one). But that's not essential.
Added a repo-user m-to-m mapping table. By not storing repos in the Remote Accounts table, we need a different way of associating users with the repos they have access to. This table is synced using code that will be included in other PRs.
This PR contains only the data model changes themselves and not the associated functional changes needed to do anything useful.
This commit on its own is UNRELEASABLE. We will merge multiple commits related to the VCS upgrade into the vcs-staging branch and then merge them all into master once we have a fully release-ready prototype. At that point, we will create a squash commit.

* Updated the data model to accommodate the new generic approach to VCS integration. This involves renaming the `github_...` tables to `vcs_...`, adding a new column to the relevant tables to identify which provider the records relate to, and more. * Added an Alembic migration, including moving the repository data from `oauthclient_remoteaccount` to the `vcs_repositories` table, which is a complex and long-running operation. This will be supplemented by a manual migration guide for instances like Zenodo where a several-minute full DB lock is not feasible. The difference between whether to use the automated migration or the manual one will be clarified in the docs. * Added a repo-user m-to-m mapping table. By not storing repos in the Remote Accounts table, we need a different way of associating users with the repos they have access to. This table is synced using code that will be included in other PRs. * This PR contains only the data model changes themselves and not the associated functional changes needed to do anything useful. * This commit on its own is UNRELEASABLE. We will merge multiple commits related to the VCS upgrade into the `vcs-staging` branch and then merge them all into `master` once we have a fully release-ready prototype. At that point, we will create a squash commit.

zzacharo

@palkerecsenyi I dont see something worrying but it is also difficult without having the tests and the functional usage of the model... We might need to revisit that across the next PRs. @slint any major thing you see?

zzacharo · 2025-10-15T12:41:31Z

invenio_vcs/models.py

+            "provider_id",
+            name="uq_vcs_repositories_provider_provider_id",
+        ),
+        # Index("ix_vcs_repositories_provider_provider_id", "provider", "provider_id"),


I think I commented this because I wasn't 100% sure about the indexes/uniques. I'm fairly certain I've arranged them correctly for the new models but I'm not super experienced with these so I'm not sure.

Hard to say without seeing the rest of the code. If there are a high number of rows and we filter by provider_id, we should probably do it.

Right now we have these indexes:

vcs_repositories:

Primary key on ID

Unique index uq_vcs_repositories_provider_provider_id on (provider, provider_id) since each provider must supply unique IDs for repos. When querying repos, we always query by provider and either provider ID or full name. And there are definitely a very high number of rows, so this makes sense for performance.

~~Unique index uq_vcs_repositories_provider_name on (provider, name).~~ We were hardly using this so I've removed it.

vcs_releases

Primary key on ID

Unique index uq_vcs_releases_provider_id_provider on (provider, provider_id). We also query releases by provider and provider ID.

Non-unique index ix_vcs_releases_record_id on record_id which already exists before the migration

vcs_repository_users

Primary key on both of the IDs

invenio_vcs/models.py

palkerecsenyi · 2025-10-15T13:12:09Z

@palkerecsenyi I dont see something worrying but it is also difficult without having the tests and the functional usage of the model... We might need to revisit that across the next PRs. @slint any major thing you see?

Yes indeed it's quite an annoying way to review sadly. If it helps, you can see the non-fragmented diff of all the code on the master branch of my fork which is kept up-to-date with the fragmented PRs.

For example the models.py file: master...palkerecsenyi:invenio-vcs:master#diff-a232ee65b447a8d90fbac12501761c411764f3570061d1b18e3e8181668fcc39

kpsherva · 2025-10-22T09:45:04Z

invenio_vcs/alembic/1754318294_switch_to_generic_git_services.py

+        existing_type=sa.Integer(),
+        existing_nullable=True,
+    )
+    op.alter_column(


what is the purpose of this column and why we modify it?

This stores the provider-specific ID of the webhook if the repository has been activated. On GitHub, this ID is always an integer so until now we have stored it as an integer. It also happens to be an integer on GitLab. But we don't know that this will be the case for all VCSes so we change it here to be stored as a string which is more flexible.

kpsherva · 2025-10-22T09:46:37Z

invenio_vcs/alembic/1754318294_switch_to_generic_git_services.py

+    op.add_column(
+        "vcs_repositories",
+        sa.Column(
+            "default_branch", sa.String(255), nullable=False, server_default="master"


how is this information used later on? why do we need to specify the default branch?

We need it to be able to generate a new file link for creating the CITATION.cff file, which is shown in the UI as a matter of convenience. It's not absolutely essential though.

https://github.com/palkerecsenyi/invenio-vcs/blob/b9c8884f99435c900234c1ebeb5abcb59c24b238/invenio_vcs/views/vcs.py#L135-L137

invenio_vcs/alembic/1754318294_switch_to_generic_git_services.py

kpsherva · 2025-10-22T09:48:43Z

invenio_vcs/alembic/1754318294_switch_to_generic_git_services.py

+        "vcs_repositories", sa.Column("license_spdx", sa.String(255), nullable=True)
+    )
+    op.alter_column("vcs_repositories", "user_id", new_column_name="enabled_by_id")
+    op.drop_index("ix_github_repositories_name")


why is it OK to drop these indices? especially the id

We currently have two indexes:

ix_github_repositories_name on the repo name

ix_github_repositories_github_id on the repo's provider (GitHub) ID

We are replacing them with these two:

uq_vcs_repositories_provider_provider_id on the combination of provider and provider_id, since each repository must have a unique ID within the context of a provider.

uq_vcs_repositories_provider_name since each repository must have a unique full name (e.g. inveniosoftware/invenio-github) within the context of a provider.

kpsherva · 2025-10-22T09:52:24Z

invenio_vcs/alembic/1754318294_switch_to_generic_git_services.py

+    op.alter_column(
+        "vcs_repositories",
+        "github_id",
+        new_column_name="provider_id",


I guess this is the id of the repository supplied by the specific provider?
if this is the case, my first thought was that provider_id means we assign an identifier to a provider (as in (github, 1), (gitlab, 2)... ) so the name of the column might not be descriptive enough to remove the ambiguity...

Yes I think it's been mentioned before the provider and provider_id have confusing names, so it's probably worth changing them. Maybe id_from_provider instead of provider_id or something similar?

kpsherva · 2025-10-22T09:54:23Z

invenio_vcs/alembic/1754318294_switch_to_generic_git_services.py

+    #
+    # We need to recreate the SQLAlchemy models for `RemoteAccount` and `Repository` here but
+    # in a much more lightweight way. We cannot simply import the models because (a) they depend
+    # on the full Invenio app being initialised and all extensions available and (b) we need


I don't fully understand why we replicate oauth remote account, won't this recipe fail the moment we try to upgrade an existing instance? or this is not "really" creating the table?

This is just creating an SQLAlchemy model of the table so we can interact with it in a similar way to the rest of our codebase. It doesn't actually attempt to create or modify the table itself.

The alternative is using raw SQL to read/insert rows, which would be confusing for such a complex migration.

kpsherva · 2025-10-22T11:40:30Z

invenio_vcs/alembic/1754318294_switch_to_generic_git_services.py

+    op.create_table(
+        "vcs_repository_users",
+        sa.Column("repository_id", UUIDType(), primary_key=True),
+        sa.Column("user_id", sa.Integer(), primary_key=True),


is it our user_id or user id from the VCS provider? I guess from vcs judging from the FK constraints, but can we be more explicit. Also from our previous experiences, having int type on ids can be very problematic, I would suggest another approach. What if some vcs provider has alphanumeric user ids?

It's our ID in this case, this table is storing which Invenio users have access to which repo. The foreign key maps it to accounts_user.id. Hence also why it's an int.

I agree the naming is confusing, maybe accounts_user_id or something similar would work better?

…on for orphaned repos

invenio_vcs/models.py

ntarocco · 2025-10-22T16:08:01Z

invenio_vcs/models.py

+            "provider_id",
+            name="uq_vcs_repositories_provider_provider_id",
+        ),
+        # Index("ix_vcs_repositories_provider_provider_id", "provider", "provider_id"),


Hard to say without seeing the rest of the code. If there are a high number of rows and we filter by provider_id, we should probably do it.

ntarocco · 2025-10-22T16:09:11Z

invenio_vcs/models.py

+    # Relationships
+    #
+    users = db.relationship(User, secondary=repository_user_association)
+    enabled_by_user = db.relationship(User, foreign_keys=[enabled_by_user_id])


does this record the last user who enabled, in case there are multiple disable/enable actions?

Yes, it records the last user to enable/disable. Currently this means that only the enabled_by_user will have permissions to manage records created from a release, although they can of course manually customise this once the record has been created.

The code for setting this is here: https://github.com/inveniosoftware/invenio-github/pull/194/files#diff-86dd04ea2c2792e65579eca9ba11e1ddf4a3a5bf5aefb53cc7f796d5b02c7e16R386-R389

ntarocco · 2025-10-22T16:16:52Z

invenio_vcs/models.py

+    def add_user(self, user_id: int):
+        """Add permission for a user to access the repository."""
+        user = User(id=user_id)
+        user = db.session.merge(user)
+        self.users.append(user)
+
+    def remove_user(self, user_id: int):
+        """Remove permission for a user to access the repository."""
+        user = User(id=user_id)
+        user = db.session.merge(user)
+        self.users.remove(user)


I don't understand these methods, what happens with self.users?
To fetch an existing user, we normally do something like this:

with db.session.no_autoflush: user = current_datastore.get_user(...)

The no_autoflush is needed because if by any chance the user obj is modified, it will be persisted in the DB.

Ahh my intention here was to add the user to the many-to-many vcs_repository_users table (and vice versa to remove it), but without running a SELECT query to get the full user, as that's unnecessary here. However I now notice that the merge function runs the SELECT anyway, so I will try inserting directly to repository_user_association maybe. Or do you think we should just query the user anyway (using current_datastore) to keep the code a little simpler?

ntarocco · 2025-10-22T16:17:42Z

invenio_vcs/models.py

+        if provider_id:
+            repo = cls.query.filter(
+                Repository.provider_id == provider_id, Repository.provider == provider
+            ).one_or_none()
+        if not repo and full_name is not None:
+            repo = cls.query.filter(
+                Repository.full_name == full_name, Repository.provider == provider
+            ).one_or_none()
+
+        return repo


Suggested change

if provider_id:

repo = cls.query.filter(

Repository.provider_id == provider_id, Repository.provider == provider

).one_or_none()

if not repo and full_name is not None:

repo = cls.query.filter(

Repository.full_name == full_name, Repository.provider == provider

).one_or_none()

return repo

if provider_id:

....

elif not repo and full_name is not None:

...

else:

raise .... ?

return repo

This would definitely be neater but it also changes the logic a little bit. I was trying to avoid changing the logic too much from how it was:

invenio-github/invenio_github/models.py

Lines 180 to 183 in abedae5

if github_id:

repo = cls.query.filter(Repository.github_id == github_id).one_or_none()

if not repo and name is not None:

repo = cls.query.filter(Repository.name == name).one_or_none()

But I am happy to refactor this and do some testing to make sure nothing breaks.

Since we were hardly even using the name-based filtering of repos, I have gone ahead and removed it entirely. This also means we can have one less index on vcs_repositories. Hope that's okay!

ntarocco · 2025-10-22T16:20:37Z

invenio_vcs/alembic/1754318294_switch_to_generic_git_services.py

+
+def upgrade():
+    """Upgrade database."""
+    op.rename_table("github_repositories", "vcs_repositories")


Question: given the complexity of this migration, would it be maybe easier to create a new table, copy over the data, and delete the old one instead?

I have considered this and I have recommended creating a new table as the manual migration method for very large instances as it makes it easier to split up the migration into gradual batches without a full DB lock. However, the code is arguably even more complicated (see the new docs in my PR, especially under 'Example script'), and the risk of accidentally losing data due to bugs in the script is slightly higher.

…r` and `remove_user`

These have been moved to a Jinja template

…ndex on vcs_repositories

palkerecsenyi changed the title ~~WIP: feat(vcs): new data model~~ feat(vcs): new data model Sep 25, 2025

palkerecsenyi force-pushed the data-layer branch from 9f1e07b to 449f41d Compare September 25, 2025 10:11

palkerecsenyi mentioned this pull request Aug 15, 2025

Make invenio-github support other VCS providers #188

Open

19 tasks

palkerecsenyi linked an issue Sep 25, 2025 that may be closed by this pull request

Make invenio-github support other VCS providers #188

Open

19 tasks

palkerecsenyi force-pushed the data-layer branch from 79ba5a6 to fc8faf7 Compare October 9, 2025 09:10

chore: pydoc

66c42c0

palkerecsenyi force-pushed the data-layer branch from fc8faf7 to 66c42c0 Compare October 9, 2025 15:57

zzacharo reviewed Oct 15, 2025

View reviewed changes

WIP: models: JSONB for errors column

bf91a21

WIP: chore: license

24cfce3

kpsherva reviewed Oct 22, 2025

View reviewed changes

invenio_vcs/alembic/1754318294_switch_to_generic_git_services.py Outdated Show resolved Hide resolved

kpsherva reviewed Oct 22, 2025

View reviewed changes

invenio_vcs/alembic/1754318294_switch_to_generic_git_services.py Outdated Show resolved Hide resolved

kpsherva reviewed Oct 22, 2025

View reviewed changes

feat(models): rename enabled_by_id -> enabled_by_user_id, add migrati…

f8a3d84

…on for orphaned repos

ntarocco reviewed Oct 22, 2025

View reviewed changes

palkerecsenyi added 7 commits October 23, 2025 09:40

fix(models): remove redundant index + improve performance of `add_use…

1c52d2d

…r` and `remove_user`

fix(models): remove html_url

e24790e

WIP: models: create upgrade script for 2-step data migration

98b86fe

WIP: models: add timestamps to vcs_repository_users

65ef06c

WIP: models: remove title/icon/color mappings for release status

8750be9

These have been moved to a Jinja template

WIP: models: simplify repository get() method, remove provider/name i…

7ba04da

…ndex on vcs_repositories

WIP: models: add list_users method to Repository

455a2c0

palkerecsenyi force-pushed the data-layer branch from 179515f to 455a2c0 Compare October 31, 2025 08:55

	if github_id:
	repo = cls.query.filter(Repository.github_id == github_id).one_or_none()
	if not repo and name is not None:
	repo = cls.query.filter(Repository.name == name).one_or_none()

feat(vcs): new data model #192

Are you sure you want to change the base?

feat(vcs): new data model #192

Uh oh!

Conversation

palkerecsenyi commented Sep 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zzacharo left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

palkerecsenyi Oct 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

palkerecsenyi commented Oct 15, 2025

Uh oh!

kpsherva Oct 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kpsherva Oct 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

palkerecsenyi Oct 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

palkerecsenyi commented Sep 25, 2025 •

edited

Loading

palkerecsenyi Oct 23, 2025 •

edited

Loading

kpsherva Oct 22, 2025 •

edited

Loading

kpsherva Oct 22, 2025 •

edited

Loading

palkerecsenyi Oct 22, 2025 •

edited

Loading