-
Notifications
You must be signed in to change notification settings - Fork 2.4k
legacy repository: prefer JSON API to HTML API #10672
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Reviewer's GuideAdds JSON Simple API support to legacy repositories, preferring JSON over HTML when available, unifies root page handling across HTML and JSON, supports relative URLs and hashes in JSON responses, and updates fixtures/tests so legacy behaviour is exercised against both HTML and JSON variants. Updated class diagram for simple repository link sources and root pagesclassDiagram
class LinkSource {
<<abstract>>
+clean_link(url: str) str
+links_for_package(name: NormalizedName) list[Link]
+yanked(name: NormalizedName, version: Version) str|bool
+_link_cache() LinkCache
}
class SimpleJsonPage {
-_url: str
-content: dict[str, Any]
+SimpleJsonPage(url: str, content: dict[str, Any])
+_link_cache() LinkCache
}
class HTMLPage {
-_url: str
-_content: str
+HTMLPage(url: str, content: str)
+_link_cache() LinkCache
}
class SimpleRepositoryRootPage {
<<abstract>>
+search(query: str|list[str]) list[str]
+package_names list[str]
}
class SimpleRepositoryHTMLRootPage {
-_parsed: list[dict[str, str]]
+SimpleRepositoryHTMLRootPage(content: str|None)
+package_names list[str]
}
class SimpleRepositoryJsonRootPage {
-_content: dict[str, Any]
+SimpleRepositoryJsonRootPage(content: dict[str, Any])
+package_names list[str]
}
class HTTPRepository {
-_url: str
-session: Session
+_get_response(endpoint: str, headers: dict[str, str]|None) Response|None
+_get_prefer_json_header() dict[str, str]
+_get_page(name: NormalizedName) LinkSource
}
class LegacyRepository {
+root_page SimpleRepositoryRootPage
}
LinkSource <|-- SimpleJsonPage
LinkSource <|-- HTMLPage
SimpleRepositoryRootPage <|-- SimpleRepositoryHTMLRootPage
SimpleRepositoryRootPage <|-- SimpleRepositoryJsonRootPage
HTTPRepository <|-- LegacyRepository
HTTPRepository --> LinkSource
LegacyRepository --> SimpleRepositoryRootPage
LegacyRepository --> SimpleRepositoryHTMLRootPage
LegacyRepository --> SimpleRepositoryJsonRootPage
File-Level Changes
Tips and commandsInteracting with Sourcery
Customizing Your ExperienceAccess your dashboard to:
Getting Help
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey - I've found 6 issues, and left some high level feedback:
- The checks for JSON responses in both
HTTPRepository._get_pageandLegacyRepository.root_pagerely on an exactContent-Type == "application/vnd.pypi.simple.v1+json"; consider parsing or normalizing the header (e.g., handling parameters like; charset=utf-8or case differences) to avoid false negatives with compliant servers.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- The checks for JSON responses in both `HTTPRepository._get_page` and `LegacyRepository.root_page` rely on an exact `Content-Type == "application/vnd.pypi.simple.v1+json"`; consider parsing or normalizing the header (e.g., handling parameters like `; charset=utf-8` or case differences) to avoid false negatives with compliant servers.
## Individual Comments
### Comment 1
<location> `src/poetry/repositories/http_repository.py:444-447` </location>
<code_context>
+ )
if not response:
raise PackageNotFoundError(f"Package [{name}] not found.")
+ if (
+ response.headers.get("Content-Type")
+ == "application/vnd.pypi.simple.v1+json"
+ ):
+ return SimpleJsonPage(f"{self._url}/{name}/", response.json())
return HTMLPage(response.url, response.text)
</code_context>
<issue_to_address>
**issue (bug_risk):** Content-Type equality check is brittle for JSON simple API responses.
A strict equality check against `"application/vnd.pypi.simple.v1+json"` will miss valid JSON responses that include parameters (e.g. `"...+json; charset=utf-8"`), causing them to be misclassified as HTML and passed to `HTMLPage`. Please relax this check (e.g. by stripping parameters, using `startswith`, or similar) so JSON responses with content-type parameters are still handled correctly.
</issue_to_address>
### Comment 2
<location> `src/poetry/repositories/legacy_repository.py:145-138` </location>
<code_context>
+ )
if not response:
raise PackageNotFoundError(f"Package [{name}] not found.")
+ if (
+ response.headers.get("Content-Type")
+ == "application/vnd.pypi.simple.v1+json"
+ ):
+ return SimpleJsonPage(f"{self._url}/{name}/", response.json())
return HTMLPage(response.url, response.text)
</code_context>
<issue_to_address>
**issue (bug_risk):** Same brittle Content-Type equality check on the root page as in HTTP repository.
If the root endpoint returns a Content-Type like `application/vnd.pypi.simple.v1+json; charset=utf-8`, this strict equality check will fail and we’ll incorrectly treat JSON as HTML. Please use a more tolerant check (e.g. compare only the media type before `;` or use `startswith`) so common header variations are handled correctly.
</issue_to_address>
### Comment 3
<location> `src/poetry/repositories/http_repository.py:448` </location>
<code_context>
+ response.headers.get("Content-Type")
+ == "application/vnd.pypi.simple.v1+json"
+ ):
+ return SimpleJsonPage(f"{self._url}/{name}/", response.json())
return HTMLPage(response.url, response.text)
</code_context>
<issue_to_address>
**suggestion:** Use `response.url` instead of reconstructing the JSON page URL.
For HTML we already use `response.url`, which accounts for redirects and server-side URL normalization. Reconstructing the JSON URL from `self._url` and `name` can diverge from the effective URL (e.g., redirects, trailing slashes) and break relative link resolution in `SimpleJsonPage`. Using `response.url` here would align the JSON behavior with HTML and avoid those base-URL inconsistencies.
</issue_to_address>
### Comment 4
<location> `src/poetry/repositories/link_sources/base.py:124-133` </location>
<code_context>
- raise PackageNotFoundError(f"Package [{name}] not found.")
- return HTMLPage(response.url, response.text)
-
@cached_property
def root_page(self) -> SimpleRepositoryRootPage:
- if not (response := self._get_response("/")):
</code_context>
<issue_to_address>
**issue (bug_risk):** Abstract `package_names` raising `NotImplementedError` conflicts with its use as a fallback instance.
Because `LegacyRepository.root_page` can return `SimpleRepositoryRootPage()` when fetching `/` fails, any later `root_page.search(...)` call will now raise `NotImplementedError` instead of returning an empty result. If `search` on this fallback is meant to be supported, consider having the base `package_names` return an empty list (or returning a concrete HTML/JSON implementation with empty content) rather than raising.
</issue_to_address>
### Comment 5
<location> `tests/repositories/link_sources/test_base.py:116-125` </location>
<code_context>
)
+
+
[email protected](
+ "query, expected",
+ [
+ ("poetry", ["poetry", "poetry-core"]),
+ (["requests", "urllib3"], ["requests", "urllib3"]),
+ ("lib", ["urllib3"]),
+ ],
+)
+def test_root_page_search(
+ root_page: SimpleRepositoryRootPage, query: str | list[str], expected: list[str]
+) -> None:
+ assert root_page.search(query) == expected
</code_context>
<issue_to_address>
**suggestion (testing):** Add a test case for `SimpleRepositoryRootPage.search` when there are no matches
Current parametrization covers several positive-match scenarios, but not the case where no package names match any token (e.g. `query="nonexistent"` or `query=["foo", "bar"]`). Please add one such case to document that `search` returns an empty list and to guard against regressions in empty-result behavior.
</issue_to_address>
### Comment 6
<location> `tests/repositories/fixtures/legacy.py:170-171` </location>
<code_context>
+ return LegacyRepository("legacy", legacy_repository_url, disable_cache=True)
+
+
[email protected](params=["legacy_repository_html", "legacy_repository_json"])
+def legacy_repository(request: FixtureRequest) -> LegacyRepository:
+ return request.getfixturevalue(request.param) # type: ignore[no-any-return]
+
</code_context>
<issue_to_address>
**suggestion (testing):** Consider adding focused tests for JSON preference and fallback behavior in HTTP/legacy repositories
The new `legacy_repository_html` / `legacy_repository_json` fixtures and the parametrized `legacy_repository` fixture nicely cover both variants across the suite, but the new behavior in `_get_page` / `root_page` is only tested indirectly. Please add small, targeted tests that:
- verify `HTTPRepository._get_page` sends the JSON-preferring `Accept` header and returns `SimpleJsonPage` when `Content-Type: application/vnd.pypi.simple.v1+json`
- verify it still returns `HTMLPage` for HTML-only responses
- verify `LegacyRepository.root_page` instantiates `SimpleRepositoryJsonRootPage` vs `SimpleRepositoryHTMLRootPage` based on response headers
This will make the JSON preference contract explicit and guard against regressions in header and content-type handling.
Suggested implementation:
```python
import pytest
from pip._internal.repositories import (
HTTPRepository,
LegacyRepository,
SimpleRepositoryHTMLRootPage,
SimpleRepositoryJsonRootPage,
)
from pip._internal.repositories.page import HTMLPage, SimpleJsonPage
def _install_fake_get_on_repo(
repo: HTTPRepository,
monkeypatch: pytest.MonkeyPatch,
*,
content_type: str,
body: str,
) -> dict[str, str]:
"""Helper to intercept headers passed by HTTPRepository._get_page.
It monkeypatches the underlying HTTP session's `get` method so we can:
- control the response headers/body
- capture the outgoing request headers
"""
captured_headers: dict[str, str] = {}
class FakeResponse:
def __init__(self, url: str) -> None:
self.url = url
self.status_code = 200
self.reason = "OK"
self.headers = {"Content-Type": content_type}
self._body = body.encode("utf-8")
self.encoding = "utf-8"
@property
def content(self) -> bytes: # HTMLPage/SimpleJsonPage usually read .content
return self._body
@property
def text(self) -> str:
return body
def fake_get(url: str, *, headers: dict[str, str] | None = None, **kwargs: object) -> FakeResponse:
if headers is not None:
captured_headers.update(headers)
return FakeResponse(url)
# HTTPRepository keeps a requests.Session-like object on `_session`
monkeypatch.setattr(repo._session, "get", fake_get, raising=True)
return captured_headers
def test_http_repository_get_page_prefers_json(monkeypatch: pytest.MonkeyPatch) -> None:
repo = HTTPRepository("simple", "https://legacy.example.com/simple")
captured_headers = _install_fake_get_on_repo(
repo,
monkeypatch,
content_type="application/vnd.pypi.simple.v1+json; charset=utf-8",
body='{"meta": {}, "projects": []}',
)
page = repo._get_page("https://legacy.example.com/simple/")
# JSON-preference contract: Accept header prefers simple-v1+json but still allows HTML
accept = captured_headers.get("Accept", "")
assert "application/vnd.pypi.simple.v1+json" in accept
assert "text/html" in accept
# And the correct page wrapper is used for JSON responses
assert isinstance(page, SimpleJsonPage)
def test_http_repository_get_page_falls_back_to_html(monkeypatch: pytest.MonkeyPatch) -> None:
repo = HTTPRepository("simple", "https://legacy.example.com/simple")
captured_headers = _install_fake_get_on_repo(
repo,
monkeypatch,
content_type="text/html; charset=utf-8",
body="<html><body>simple index</body></html>",
)
page = repo._get_page("https://legacy.example.com/simple/")
# Even when we get HTML, the Accept header should still express JSON preference
accept = captured_headers.get("Accept", "")
assert "application/vnd.pypi.simple.v1+json" in accept
assert "text/html" in accept
# HTML-only responses should still yield HTMLPage
assert isinstance(page, HTMLPage)
def test_legacy_repository_root_page_selects_json_vs_html(monkeypatch: pytest.MonkeyPatch) -> None:
base_url = "https://legacy.example.com/simple"
legacy_repo = LegacyRepository("legacy", base_url, disable_cache=True)
# First, simulate a JSON-capable simple index
def fake_get_json(url: str, *, headers: dict[str, str] | None = None, **kwargs: object):
class FakeResponse:
def __init__(self) -> None:
self.url = url
self.status_code = 200
self.reason = "OK"
self.headers = {"Content-Type": "application/vnd.pypi.simple.v1+json"}
self._body = b'{"meta": {}, "projects": []}'
self.encoding = "utf-8"
@property
def content(self) -> bytes:
return self._body
@property
def text(self) -> str:
return self._body.decode(self.encoding)
return FakeResponse()
# Then, simulate an HTML-only simple index
def fake_get_html(url: str, *, headers: dict[str, str] | None = None, **kwargs: object):
class FakeResponse:
def __init__(self) -> None:
self.url = url
self.status_code = 200
self.reason = "OK"
self.headers = {"Content-Type": "text/html; charset=utf-8"}
self._body = b"<html><body>simple index</body></html>"
self.encoding = "utf-8"
@property
def content(self) -> bytes:
return self._body
@property
def text(self) -> str:
return self._body.decode(self.encoding)
return FakeResponse()
# LegacyRepository internally wraps an HTTPRepository; patch its session in turn.
# First assertion: JSON response -> SimpleRepositoryJsonRootPage
http_repo = legacy_repo._simple_repository # type: ignore[attr-defined]
monkeypatch.setattr(http_repo._session, "get", fake_get_json, raising=True) # type: ignore[attr-defined]
json_root_page = legacy_repo.root_page()
assert isinstance(json_root_page, SimpleRepositoryJsonRootPage)
# Second assertion: HTML response -> SimpleRepositoryHTMLRootPage
monkeypatch.setattr(http_repo._session, "get", fake_get_html, raising=True) # type: ignore[attr-defined]
html_root_page = legacy_repo.root_page()
assert isinstance(html_root_page, SimpleRepositoryHTMLRootPage)
```
These edits assume:
1. The repository and page classes are imported from `pip._internal.repositories` and `pip._internal.repositories.page`. If your project uses different modules (e.g. `pip._internal.index.collector` or `pip._internal.network.simple`), adjust the import paths accordingly.
2. `HTTPRepository` exposes a `._session` attribute compatible with `requests.Session` and `LegacyRepository` exposes an internal `._simple_repository` with its own `._session`. If those internal attributes use different names, update the `monkeypatch.setattr(...)` targets to match.
3. `SimpleJsonPage` and `HTMLPage` are the concrete page types produced by `HTTPRepository._get_page`, and `SimpleRepositoryJsonRootPage` / `SimpleRepositoryHTMLRootPage` are returned by `LegacyRepository.root_page`. If your types differ, adjust the `isinstance` checks.
If there is already a shared helper for building fake HTTP responses or a common pattern for testing repositories in a dedicated test module (e.g. `tests/repositories/test_http_repository.py`), you may want to move these tests there and reuse the existing helpers instead of keeping them in the fixtures module.
</issue_to_address>Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
dbaf37d to
278eb37
Compare
Motivation: some information (e.g. size and upload-time) that may be required in future for features like pylock.toml and minimumReleaseAge/exclude-newer is only available via JSON Changes: * prefer JSON in legacy repositories and fallback to HTML if JSON is not supported * add support for JSON root pages * add support for relative URLs in JSON pages * add support for hashes in JSON pages * extend legacy tests so that they are run with the HTML variant and the JSON variant * harmonize HTML and JSON fixtures so that we get the same results in the tests
278eb37 to
7ec0b2a
Compare
Motivation
Some information (e.g.
sizeandupload-time) is only available via JSON.sizeandupload-timeare optional inpylock.toml, butsizeis at least a "should".upload-timeis also useful for features like #10646 / https://github.com/orgs/python-poetry/discussions/10555Changes
Pull Request Check List
Related-to: python-poetry/poetry-plugin-export#336
Related-to: #10356
Related-to: #10646
Summary by Sourcery
Prefer JSON Simple API for legacy repositories while keeping HTML as a fallback, and introduce a shared root-page abstraction used by both HTML and JSON link sources.
New Features:
Enhancements:
Tests: