Skip to content

Conversation

@dicagno
Copy link
Member

@dicagno dicagno commented Nov 25, 2025

Please ensure your pull request adheres to the following guidelines:

  • make sure to link the related issues in this description
  • when merging / squashing, make sure the fixed issue references are visible in the commits, for easy compilation of release notes
  • If data sources for any opportunity has been updated/added, please update the wiki for same opportunity.

Related Issues

Thanks for contributing!

Copy link
Contributor

@hannessolo hannessolo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This probably works, but I'm wondering if we shouldn't just change the approach completely.

Eg. parse the html -> strip header/footer completely -> convert page content to markdown -> run readability for each paragraph of the markdown content

I think that would be more reliable than what we're doing here, eg. what if we have a p inside the header, there just seem many edge cases where this can still give werid results.

Or what happens when we have a li inside a p tag here? Does it run twice for that content?

@github-actions
Copy link

This PR will trigger a minor release when merged.

Copy link
Contributor

@hannessolo hannessolo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given time pressure this is ok, but as I said, I think long term a solution extracting the text more generally, eg though md conversion, would be more robust.

@dicagno
Copy link
Member Author

dicagno commented Nov 25, 2025

Given time pressure this is ok, but as I said, I think long term a solution extracting the text more generally, eg though md conversion, would be more robust.

agree, as discussed. i would add that using md would be beneficial to preserve "rich text" elements and include them properly in the "auto fix" stage

@dicagno dicagno closed this Nov 25, 2025
@dicagno dicagno reopened this Nov 25, 2025
@dicagno
Copy link
Member Author

dicagno commented Nov 26, 2025

moved to #1668

@dicagno dicagno closed this Nov 26, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants