Skip to content

Conversation

@didier-durand
Copy link
Contributor

@didier-durand didier-durand commented Oct 27, 2025

Hi,
adding a github workflow to validate - at least daily - that all links from Airflow Doc are correct.

This workflow checks all links from Airflow doc via Lychee: https://github.com/lycheeverse/lychee

Specific aspects:

  1. The check is done directly from website in case build process creates / transforms links
  2. It has a schedule to run at least once a day in order to detect dangling external links in case there
    is no activity on this repo itself.

When tested on 'https://airflow.apache.org/docs/' from my repo clone, it delivers this report:

image

Didier

@potiuk
Copy link
Member

potiuk commented Oct 27, 2025

  1. Nice.
  2. What does it deliver on Airflow ? - it would make no sense to add the check before we fix the links if they are broken - and I expect few thousands of those at the very least for all the historical versions of documentatation - also with Airflow size of docs just checking it will likely take hours - we have literally tens of millions of pages
  3. We are not suppoosed to use actions that are not approved by the ASF, so we will need to submit the action https://github.com/apache/infrastructure-actions?tab=readme-ov-file#management-of-organization-wide-github-actions-allow-list using the specific version
  4. We are also not supposed (following security guidelines by both ASF and github) refer to actions via version tag - we should use commit SHA. This also means that we should likely add octopin pre-commit to verify this (As we do in airflow repo) - if we are to add more actions -> https://github.com/apache/airflow/blob/main/dev/.pre-commit-config.yaml

@potiuk potiuk closed this Oct 27, 2025
@potiuk potiuk reopened this Oct 27, 2025
@didier-durand
Copy link
Contributor Author

didier-durand commented Oct 27, 2025

Hi,

  1. Sure, if the rule says to validate the action by ASF, please do it.
  2. My initial text was misleading: I fixed it to "_When tested on 'https://airflow.apache.org/docs/' from my repo clone, it delivers this report:"

So, as per image above, Lychee currently checks 152 on the Airflow doc site and find no errors.

@didier-durand didier-durand changed the title [CI] Lychee: automated check of links [CI] Lychee: automated check of links in doc Oct 27, 2025
@potiuk
Copy link
Member

potiuk commented Oct 27, 2025

So, as per image above, Lychee currently checks 152 on the Airflow doc site and find no errors.

Note that complete Airflow site also includes 10s of millions of pages lines (and ~250K files) coming from all historical versions of airflow and 90+ packages. They are not available when you build your site locally - in the actual site they are coming from S3 buckets via cloudfront and they are available via .htacces redirection from there. I believe the site check is on "actual" https://airflow.apache.org and that is going to include all that as far as I understand how it works.

@potiuk
Copy link
Member

potiuk commented Oct 27, 2025

More information how our docs are built (and architecture) https://github.com/apache/airflow/blob/main/docs/README.md

@potiuk
Copy link
Member

potiuk commented Oct 27, 2025

Ah - sorry it's much more than 10M lines, but it's just 265 K .html files. Still a lot. I am running some bulk update on those now - so counted them :D

@didier-durand
Copy link
Contributor Author

Hi, I'll extend my workflow code tomorrow to exactly list all the links that are checked by Lychee.

@didier-durand
Copy link
Contributor Author

Hi, I changed my code to have a much deeper scan but I now get lots of errors due to obsolete links mostly in release notes (old JIRA links, etc.) . So, I'll close this for now and try to come up with a smarter approach to detect real issues only.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants