legacy repository: prefer JSON API to HTML API #10672

radoering · 2025-12-28T07:13:34Z

Motivation

Some information (e.g. size and upload-time) is only available via JSON. size and upload-time are optional in pylock.toml, but size is at least a "should". upload-time is also useful for features like #10646 / https://github.com/orgs/python-poetry/discussions/10555

Changes

prefer JSON in legacy repositories and fallback to HTML if JSON is not supported
add support for JSON root pages
add support for relative URLs in JSON pages
add support for hashes in JSON pages
extend legacy tests so that they are run with the HTML variant and the JSON variant
harmonize HTML and JSON fixtures so that we get the same results in the tests

Pull Request Check List

Related-to: python-poetry/poetry-plugin-export#336
Related-to: #10356
Related-to: #10646

Added tests for changed code.
Updated documentation for changed code.

Summary by Sourcery

Prefer JSON Simple API for legacy repositories while keeping HTML as a fallback, and introduce a shared root-page abstraction used by both HTML and JSON link sources.

New Features:

Support parsing and use of PEP 691 JSON Simple API pages for legacy repositories, including root project listings, hashes, and yanked metadata.
Allow JSON and HTML legacy repositories to be exercised interchangeably via parametrized test fixtures.

Enhancements:

Introduce a common SimpleRepositoryRootPage base class with unified search semantics shared by HTML and JSON root pages.
Normalize URL resolution for links in HTML and JSON pages, including handling of repository base URLs and hash fragments.
Extend legacy fixtures to better align HTML and JSON repositories and to expose additional distributions for search and yanking scenarios.

Tests:

Add unit tests for JSON link source behavior, including root page parsing, yanked handling, hashes, and relative URLs.
Update HTML link source tests for root page parsing, hash extraction from URLs, and base URL handling.
Parametrize legacy repository tests to run against both HTML and JSON variants where appropriate and adjust expectations accordingly.
Adapt solver, search command, and installation tests to work with the new legacy repository behavior and fixtures.

sourcery-ai · 2025-12-28T07:13:41Z

Reviewer's Guide

Adds JSON Simple API support to legacy repositories, preferring JSON over HTML when available, unifies root page handling across HTML and JSON, supports relative URLs and hashes in JSON responses, and updates fixtures/tests so legacy behaviour is exercised against both HTML and JSON variants.

Updated class diagram for simple repository link sources and root pages

classDiagram
    class LinkSource {
        <<abstract>>
        +clean_link(url: str) str
        +links_for_package(name: NormalizedName) list[Link]
        +yanked(name: NormalizedName, version: Version) str|bool
        +_link_cache() LinkCache
    }

    class SimpleJsonPage {
        -_url: str
        -content: dict[str, Any]
        +SimpleJsonPage(url: str, content: dict[str, Any])
        +_link_cache() LinkCache
    }

    class HTMLPage {
        -_url: str
        -_content: str
        +HTMLPage(url: str, content: str)
        +_link_cache() LinkCache
    }

    class SimpleRepositoryRootPage {
        <<abstract>>
        +search(query: str|list[str]) list[str]
        +package_names list[str]
    }

    class SimpleRepositoryHTMLRootPage {
        -_parsed: list[dict[str, str]]
        +SimpleRepositoryHTMLRootPage(content: str|None)
        +package_names list[str]
    }

    class SimpleRepositoryJsonRootPage {
        -_content: dict[str, Any]
        +SimpleRepositoryJsonRootPage(content: dict[str, Any])
        +package_names list[str]
    }

    class HTTPRepository {
        -_url: str
        -session: Session
        +_get_response(endpoint: str, headers: dict[str, str]|None) Response|None
        +_get_prefer_json_header() dict[str, str]
        +_get_page(name: NormalizedName) LinkSource
    }

    class LegacyRepository {
        +root_page SimpleRepositoryRootPage
    }

    LinkSource <|-- SimpleJsonPage
    LinkSource <|-- HTMLPage

    SimpleRepositoryRootPage <|-- SimpleRepositoryHTMLRootPage
    SimpleRepositoryRootPage <|-- SimpleRepositoryJsonRootPage

    HTTPRepository <|-- LegacyRepository

    HTTPRepository --> LinkSource
    LegacyRepository --> SimpleRepositoryRootPage
    LegacyRepository --> SimpleRepositoryHTMLRootPage
    LegacyRepository --> SimpleRepositoryJsonRootPage

File-Level Changes

Change	Details	Files
Prefer JSON Simple API over HTML for package pages and root index when available, including link hashing and relative URL handling.	Extended HTTPRepository._get_response to accept optional headers and added a helper to request JSON with appropriate Accept values. Updated HTTPRepository._get_page to request JSON first and return a SimpleJsonPage if the response Content-Type is the PEP 691 JSON media type, otherwise fall back to HTMLPage. Enhanced SimpleJsonPage to resolve relative file URLs against the repository URL, pass through hashes, and construct Link objects with these hashes and metadata.	`src/poetry/repositories/http_repository.py` `src/poetry/repositories/link_sources/json.py` `tests/repositories/link_sources/test_json.py`
Introduce a shared SimpleRepositoryRootPage abstraction and separate HTML vs JSON root-page implementations used by legacy repositories and search.	Introduced SimpleRepositoryRootPage with a generic search implementation over a package_names property. Refactored the existing HTML root page into SimpleRepositoryHTMLRootPage that now only parses anchors and exposes package_names. Added SimpleRepositoryJsonRootPage that reads project names from PEP 691 JSON root responses and returns them via package_names. Adapted LegacyRepository.root_page to prefer JSON (SimpleRepositoryJsonRootPage) using Accept headers, falling back to SimpleRepositoryHTMLRootPage. Updated base/root page tests to cover search behaviour and HTML/JSON concrete implementations.	`src/poetry/repositories/link_sources/base.py` `src/poetry/repositories/link_sources/html.py` `src/poetry/repositories/link_sources/json.py` `src/poetry/repositories/legacy_repository.py` `tests/repositories/link_sources/test_base.py` `tests/repositories/link_sources/test_html.py` `tests/repositories/link_sources/test_json.py`
Extend legacy fixtures and tests to run against both HTML and JSON variants, harmonizing fixture data and adjusting expectations.	Added JSON fixtures for legacy packages and a fixture set that can serve either HTML or JSON responses depending on parametrization. Introduced separate legacy_repository_html and legacy_repository_json fixtures and a parametrized legacy_repository that iterates over both for most tests. Adjusted specialized legacy repository mocking to work with the generic LinkSource instead of HTMLPage and updated tests that need HTML-only behaviour (solver, pool, partial-yank) to explicitly depend on legacy_repository_html. Updated search and yanking tests to account for additional packages/links available in fixtures and for unified yanked handling across variants. Extended HTML fixtures (e.g., ipython, black) to add more distributions and hash attributes so HTML and JSON behaviours align with the new expectations.	`tests/repositories/fixtures/legacy.py` `tests/repositories/test_legacy_repository.py` `tests/puzzle/test_solver.py` `tests/installation/conftest.py` `tests/console/commands/test_search.py` `tests/repositories/fixtures/legacy/.html` `tests/repositories/fixtures/legacy/json/.json`

Tips and commands

Interacting with Sourcery

Trigger a new review: Comment @sourcery-ai review on the pull request.
Continue discussions: Reply directly to Sourcery's review comments.
Generate a GitHub issue from a review comment: Ask Sourcery to create an
issue from a review comment by replying to it. You can also reply to a
review comment with @sourcery-ai issue to create an issue from it.
Generate a pull request title: Write @sourcery-ai anywhere in the pull
request title to generate a title at any time. You can also comment
@sourcery-ai title on the pull request to (re-)generate the title at any time.
Generate a pull request summary: Write @sourcery-ai summary anywhere in
the pull request body to generate a PR summary at any time exactly where you
want it. You can also comment @sourcery-ai summary on the pull request to
(re-)generate the summary at any time.
Generate reviewer's guide: Comment @sourcery-ai guide on the pull
request to (re-)generate the reviewer's guide at any time.
Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
pull request to resolve all Sourcery comments. Useful if you've already
addressed all the comments and don't want to see them anymore.
Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
request to dismiss all existing Sourcery reviews. Especially useful if you
want to start fresh with a new review - don't forget to comment
@sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

Enable or disable review features such as the Sourcery-generated pull request
summary, the reviewer's guide, and others.
Change the review language.
Add, remove or edit custom review instructions.
Adjust other review settings.

Getting Help

Contact our support team for questions or feedback.
Visit our documentation for detailed guides and information.
Keep in touch with the Sourcery team by following us on X/Twitter, LinkedIn or GitHub.

sourcery-ai

Hey - I've found 6 issues, and left some high level feedback:

The checks for JSON responses in both HTTPRepository._get_page and LegacyRepository.root_page rely on an exact Content-Type == "application/vnd.pypi.simple.v1+json"; consider parsing or normalizing the header (e.g., handling parameters like ; charset=utf-8 or case differences) to avoid false negatives with compliant servers.

Prompt for AI Agents

Please address the comments from this code review:

## Overall Comments
- The checks for JSON responses in both `HTTPRepository._get_page` and `LegacyRepository.root_page` rely on an exact `Content-Type == "application/vnd.pypi.simple.v1+json"`; consider parsing or normalizing the header (e.g., handling parameters like `; charset=utf-8` or case differences) to avoid false negatives with compliant servers.

## Individual Comments

### Comment 1
<location> `src/poetry/repositories/http_repository.py:444-447` </location>
<code_context>
+        )
         if not response:
             raise PackageNotFoundError(f"Package [{name}] not found.")
+        if (
+            response.headers.get("Content-Type")
+            == "application/vnd.pypi.simple.v1+json"
+        ):
+            return SimpleJsonPage(f"{self._url}/{name}/", response.json())
         return HTMLPage(response.url, response.text)
</code_context>

<issue_to_address>
**issue (bug_risk):** Content-Type equality check is brittle for JSON simple API responses.

A strict equality check against `"application/vnd.pypi.simple.v1+json"` will miss valid JSON responses that include parameters (e.g. `"...+json; charset=utf-8"`), causing them to be misclassified as HTML and passed to `HTMLPage`. Please relax this check (e.g. by stripping parameters, using `startswith`, or similar) so JSON responses with content-type parameters are still handled correctly.
</issue_to_address>

### Comment 2
<location> `src/poetry/repositories/legacy_repository.py:145-138` </location>
<code_context>
+        )
         if not response:
             raise PackageNotFoundError(f"Package [{name}] not found.")
+        if (
+            response.headers.get("Content-Type")
+            == "application/vnd.pypi.simple.v1+json"
+        ):
+            return SimpleJsonPage(f"{self._url}/{name}/", response.json())
         return HTMLPage(response.url, response.text)
</code_context>

<issue_to_address>
**issue (bug_risk):** Same brittle Content-Type equality check on the root page as in HTTP repository.

If the root endpoint returns a Content-Type like `application/vnd.pypi.simple.v1+json; charset=utf-8`, this strict equality check will fail and we’ll incorrectly treat JSON as HTML. Please use a more tolerant check (e.g. compare only the media type before `;` or use `startswith`) so common header variations are handled correctly.
</issue_to_address>

### Comment 3
<location> `src/poetry/repositories/http_repository.py:448` </location>
<code_context>
+            response.headers.get("Content-Type")
+            == "application/vnd.pypi.simple.v1+json"
+        ):
+            return SimpleJsonPage(f"{self._url}/{name}/", response.json())
         return HTMLPage(response.url, response.text)
</code_context>

<issue_to_address>
**suggestion:** Use `response.url` instead of reconstructing the JSON page URL.

For HTML we already use `response.url`, which accounts for redirects and server-side URL normalization. Reconstructing the JSON URL from `self._url` and `name` can diverge from the effective URL (e.g., redirects, trailing slashes) and break relative link resolution in `SimpleJsonPage`. Using `response.url` here would align the JSON behavior with HTML and avoid those base-URL inconsistencies.
</issue_to_address>

### Comment 4
<location> `src/poetry/repositories/link_sources/base.py:124-133` </location>
<code_context>
-            raise PackageNotFoundError(f"Package [{name}] not found.")
-        return HTMLPage(response.url, response.text)
-
     @cached_property
     def root_page(self) -> SimpleRepositoryRootPage:
-        if not (response := self._get_response("/")):
</code_context>

<issue_to_address>
**issue (bug_risk):** Abstract `package_names` raising `NotImplementedError` conflicts with its use as a fallback instance.

Because `LegacyRepository.root_page` can return `SimpleRepositoryRootPage()` when fetching `/` fails, any later `root_page.search(...)` call will now raise `NotImplementedError` instead of returning an empty result. If `search` on this fallback is meant to be supported, consider having the base `package_names` return an empty list (or returning a concrete HTML/JSON implementation with empty content) rather than raising.
</issue_to_address>

### Comment 5
<location> `tests/repositories/link_sources/test_base.py:116-125` </location>
<code_context>
     )
+
+
[email protected](
+    "query, expected",
+    [
+        ("poetry", ["poetry", "poetry-core"]),
+        (["requests", "urllib3"], ["requests", "urllib3"]),
+        ("lib", ["urllib3"]),
+    ],
+)
+def test_root_page_search(
+    root_page: SimpleRepositoryRootPage, query: str | list[str], expected: list[str]
+) -> None:
+    assert root_page.search(query) == expected
</code_context>

<issue_to_address>
**suggestion (testing):** Add a test case for `SimpleRepositoryRootPage.search` when there are no matches

Current parametrization covers several positive-match scenarios, but not the case where no package names match any token (e.g. `query="nonexistent"` or `query=["foo", "bar"]`). Please add one such case to document that `search` returns an empty list and to guard against regressions in empty-result behavior.
</issue_to_address>

### Comment 6
<location> `tests/repositories/fixtures/legacy.py:170-171` </location>
<code_context>
+    return LegacyRepository("legacy", legacy_repository_url, disable_cache=True)
+
+
[email protected](params=["legacy_repository_html", "legacy_repository_json"])
+def legacy_repository(request: FixtureRequest) -> LegacyRepository:
+    return request.getfixturevalue(request.param)  # type: ignore[no-any-return]
+
</code_context>

<issue_to_address>
**suggestion (testing):** Consider adding focused tests for JSON preference and fallback behavior in HTTP/legacy repositories

The new `legacy_repository_html` / `legacy_repository_json` fixtures and the parametrized `legacy_repository` fixture nicely cover both variants across the suite, but the new behavior in `_get_page` / `root_page` is only tested indirectly. Please add small, targeted tests that:

- verify `HTTPRepository._get_page` sends the JSON-preferring `Accept` header and returns `SimpleJsonPage` when `Content-Type: application/vnd.pypi.simple.v1+json`
- verify it still returns `HTMLPage` for HTML-only responses
- verify `LegacyRepository.root_page` instantiates `SimpleRepositoryJsonRootPage` vs `SimpleRepositoryHTMLRootPage` based on response headers

This will make the JSON preference contract explicit and guard against regressions in header and content-type handling.

Suggested implementation:

```python
import pytest

from pip._internal.repositories import (
    HTTPRepository,
    LegacyRepository,
    SimpleRepositoryHTMLRootPage,
    SimpleRepositoryJsonRootPage,
)
from pip._internal.repositories.page import HTMLPage, SimpleJsonPage


def _install_fake_get_on_repo(
    repo: HTTPRepository,
    monkeypatch: pytest.MonkeyPatch,
    *,
    content_type: str,
    body: str,
) -> dict[str, str]:
    """Helper to intercept headers passed by HTTPRepository._get_page.

    It monkeypatches the underlying HTTP session's `get` method so we can:
    - control the response headers/body
    - capture the outgoing request headers
    """
    captured_headers: dict[str, str] = {}

    class FakeResponse:
        def __init__(self, url: str) -> None:
            self.url = url
            self.status_code = 200
            self.reason = "OK"
            self.headers = {"Content-Type": content_type}
            self._body = body.encode("utf-8")
            self.encoding = "utf-8"

        @property
        def content(self) -> bytes:  # HTMLPage/SimpleJsonPage usually read .content
            return self._body

        @property
        def text(self) -> str:
            return body

    def fake_get(url: str, *, headers: dict[str, str] | None = None, **kwargs: object) -> FakeResponse:
        if headers is not None:
            captured_headers.update(headers)
        return FakeResponse(url)

    # HTTPRepository keeps a requests.Session-like object on `_session`
    monkeypatch.setattr(repo._session, "get", fake_get, raising=True)

    return captured_headers


def test_http_repository_get_page_prefers_json(monkeypatch: pytest.MonkeyPatch) -> None:
    repo = HTTPRepository("simple", "https://legacy.example.com/simple")

    captured_headers = _install_fake_get_on_repo(
        repo,
        monkeypatch,
        content_type="application/vnd.pypi.simple.v1+json; charset=utf-8",
        body='{"meta": {}, "projects": []}',
    )

    page = repo._get_page("https://legacy.example.com/simple/")

    # JSON-preference contract: Accept header prefers simple-v1+json but still allows HTML
    accept = captured_headers.get("Accept", "")
    assert "application/vnd.pypi.simple.v1+json" in accept
    assert "text/html" in accept

    # And the correct page wrapper is used for JSON responses
    assert isinstance(page, SimpleJsonPage)


def test_http_repository_get_page_falls_back_to_html(monkeypatch: pytest.MonkeyPatch) -> None:
    repo = HTTPRepository("simple", "https://legacy.example.com/simple")

    captured_headers = _install_fake_get_on_repo(
        repo,
        monkeypatch,
        content_type="text/html; charset=utf-8",
        body="<html><body>simple index</body></html>",
    )

    page = repo._get_page("https://legacy.example.com/simple/")

    # Even when we get HTML, the Accept header should still express JSON preference
    accept = captured_headers.get("Accept", "")
    assert "application/vnd.pypi.simple.v1+json" in accept
    assert "text/html" in accept

    # HTML-only responses should still yield HTMLPage
    assert isinstance(page, HTMLPage)


def test_legacy_repository_root_page_selects_json_vs_html(monkeypatch: pytest.MonkeyPatch) -> None:
    base_url = "https://legacy.example.com/simple"
    legacy_repo = LegacyRepository("legacy", base_url, disable_cache=True)

    # First, simulate a JSON-capable simple index
    def fake_get_json(url: str, *, headers: dict[str, str] | None = None, **kwargs: object):
        class FakeResponse:
            def __init__(self) -> None:
                self.url = url
                self.status_code = 200
                self.reason = "OK"
                self.headers = {"Content-Type": "application/vnd.pypi.simple.v1+json"}
                self._body = b'{"meta": {}, "projects": []}'
                self.encoding = "utf-8"

            @property
            def content(self) -> bytes:
                return self._body

            @property
            def text(self) -> str:
                return self._body.decode(self.encoding)

        return FakeResponse()

    # Then, simulate an HTML-only simple index
    def fake_get_html(url: str, *, headers: dict[str, str] | None = None, **kwargs: object):
        class FakeResponse:
            def __init__(self) -> None:
                self.url = url
                self.status_code = 200
                self.reason = "OK"
                self.headers = {"Content-Type": "text/html; charset=utf-8"}
                self._body = b"<html><body>simple index</body></html>"
                self.encoding = "utf-8"

            @property
            def content(self) -> bytes:
                return self._body

            @property
            def text(self) -> str:
                return self._body.decode(self.encoding)

        return FakeResponse()

    # LegacyRepository internally wraps an HTTPRepository; patch its session in turn.
    # First assertion: JSON response -> SimpleRepositoryJsonRootPage
    http_repo = legacy_repo._simple_repository  # type: ignore[attr-defined]
    monkeypatch.setattr(http_repo._session, "get", fake_get_json, raising=True)  # type: ignore[attr-defined]
    json_root_page = legacy_repo.root_page()
    assert isinstance(json_root_page, SimpleRepositoryJsonRootPage)

    # Second assertion: HTML response -> SimpleRepositoryHTMLRootPage
    monkeypatch.setattr(http_repo._session, "get", fake_get_html, raising=True)  # type: ignore[attr-defined]
    html_root_page = legacy_repo.root_page()
    assert isinstance(html_root_page, SimpleRepositoryHTMLRootPage)

```

These edits assume:
1. The repository and page classes are imported from `pip._internal.repositories` and `pip._internal.repositories.page`. If your project uses different modules (e.g. `pip._internal.index.collector` or `pip._internal.network.simple`), adjust the import paths accordingly.
2. `HTTPRepository` exposes a `._session` attribute compatible with `requests.Session` and `LegacyRepository` exposes an internal `._simple_repository` with its own `._session`. If those internal attributes use different names, update the `monkeypatch.setattr(...)` targets to match.
3. `SimpleJsonPage` and `HTMLPage` are the concrete page types produced by `HTTPRepository._get_page`, and `SimpleRepositoryJsonRootPage` / `SimpleRepositoryHTMLRootPage` are returned by `LegacyRepository.root_page`. If your types differ, adjust the `isinstance` checks.

If there is already a shared helper for building fake HTTP responses or a common pattern for testing repositories in a dedicated test module (e.g. `tests/repositories/test_http_repository.py`), you may want to move these tests there and reuse the existing helpers instead of keeping them in the fixtures module.
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨

_{Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.}

src/poetry/repositories/http_repository.py

src/poetry/repositories/legacy_repository.py

src/poetry/repositories/http_repository.py

src/poetry/repositories/link_sources/base.py

tests/repositories/link_sources/test_base.py

tests/repositories/fixtures/legacy.py

Motivation: some information (e.g. size and upload-time) that may be required in future for features like pylock.toml and minimumReleaseAge/exclude-newer is only available via JSON Changes: * prefer JSON in legacy repositories and fallback to HTML if JSON is not supported * add support for JSON root pages * add support for relative URLs in JSON pages * add support for hashes in JSON pages * extend legacy tests so that they are run with the HTML variant and the JSON variant * harmonize HTML and JSON fixtures so that we get the same results in the tests

sourcery-ai bot reviewed Dec 28, 2025

View reviewed changes

radoering force-pushed the legacy-json-api branch from dbaf37d to 278eb37 Compare December 28, 2025 09:57

radoering force-pushed the legacy-json-api branch from 278eb37 to 7ec0b2a Compare December 31, 2025 06:46

radoering merged commit 93baede into python-poetry:main Dec 31, 2025
54 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

legacy repository: prefer JSON API to HTML API #10672

legacy repository: prefer JSON API to HTML API #10672

Uh oh!

radoering commented Dec 28, 2025 •

edited by sourcery-ai bot

Loading

Uh oh!

sourcery-ai bot commented Dec 28, 2025 •

edited

Loading

Interacting with Sourcery

Customizing Your Experience

Getting Help

Uh oh!

sourcery-ai bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

legacy repository: prefer JSON API to HTML API #10672

legacy repository: prefer JSON API to HTML API #10672

Uh oh!

Conversation

radoering commented Dec 28, 2025 • edited by sourcery-ai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Changes

Pull Request Check List

Summary by Sourcery

Uh oh!

sourcery-ai bot commented Dec 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviewer's Guide

Updated class diagram for simple repository link sources and root pages

File-Level Changes

Interacting with Sourcery

Customizing Your Experience

Getting Help

Uh oh!

sourcery-ai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

radoering commented Dec 28, 2025 •

edited by sourcery-ai bot

Loading

sourcery-ai bot commented Dec 28, 2025 •

edited

Loading