⚡ Convert Websites into High-Quality Training Data for LLMs & AI-First SEO Build optimized llms.txt files and clean Markdown datasets from real website content — perfect for LLM pipelines, AI-ready sitemaps, and content-aware SEO.
- Navigation Extraction: Scrape and extract navigation elements from websites using customizable CSS selectors to generate a llms.txt preview that you can edit to further optimize before downloading
- Link Conversion: Transform links in various formats (Markdown, HTML, raw URLs) into a standardized format
- URL to Markdown: Process URLs to extract key content and convert it to Markdown format
- 🧠 Create better training datasets for custom LLMs and retrieval-augmented generation (RAG) systems
- 🔍 Make your content more accessible to generative search engines and AI assistants
- 🛠️ Quickly generate structured Markdown from live websites — no manual cleanup
- 📁 Design AI-first sitemaps (
llms.txt) to guide crawlers and reduce hallucination
- SEO professionals building
llms.txtfiles for AI search engines - AI engineers preparing website-derived training datasets
- Product teams designing AI-friendly site structures
- Anyone exploring the intersection of SEO and LLMs
Try it here → llms-generator-toolkit.onrender.com
- Generate a semantic site map from any homepage navigation
- Convert links into AI-readable Markdown
- Extract clean Markdown from product/category/landing pages
Note: The demo is hosted on Render's free tier plan, so the navigation extraction feature may be slow to respond. This is due to free tier limitations rather than issues with the application itself. For faster performance, consider running the application locally.
- Built with Dash for reactive Python UI
- Uses Playwright to scrape dynamic websites
- HTML parsing via BeautifulSoup + semantic Markdown generation
- Runs on Render with Chromium sandboxing enabled
- Python 3.8+ (Python 3.11+ recommended)
pippackage manager- Playwright (installed automatically)
git clone https://github.com/jeredhiggins/llms-generator-toolkit.git
cd llms-generator-toolkitpython -m venv .venv
source .venv/bin/activate # On Windows: .venv\Scripts\activatepip install --upgrade pip
pip install -r requirements.txtpython -m playwright install chromiumpython app.pyThen open your browser at:
http://127.0.0.1:8050
This project is configured to deploy easily on Render.
- Chromium is installed manually during the build without
--with-deps - The browser cache is stored at
/opt/render/.cache/ms-playwright - No system-level dependencies are required due to a clean headless setup
Your render.yaml handles all of this automatically:
buildCommand: |
python -m venv .venv
source .venv/bin/activate
pip install --upgrade pip
pip install -r requirements.txt
mkdir -p /opt/render/.cache/ms-playwright
python -m playwright install --force chromiumThe app is started using gunicorn:
startCommand: .venv/bin/gunicorn --bind 0.0.0.0:$PORT --timeout 600 app:serverenvVars:
- key: PYTHONUNBUFFERED
value: true
- key: PLAYWRIGHT_BROWSERS_PATH
value: /opt/render/.cache/ms-playwrightExtract the navigation structure from a website:
- Enter the homepage URL (e.g.,
https://www.lego.com/en-ie) - Specify the root navigation selector (e.g.,
nav[data-test="main-navigation"]) - Optionally, provide a context selector to refine link extraction
- Click "Extract Navigation"
- View the Markdown-formatted navigation structure
- Further customize/edit the generated markdown
- Download the result as
llms.txt
Convert links from various formats to a standardized format:
- Paste Markdown links, HTML links, or raw URLs in the input area
- Click "Convert Links"
- The converted links will appear in standardized Markdown format
- Copy and use the converted links as needed
Process URLs to extract key content in Markdown format:
- Enter one or more URLs (one per line) in the input area
- Click "Convert URLs"
- View the extracted Markdown content for each URL
- Download the results as individual Markdown files or a ZIP archive
The llms.txt file generated by this tool serves as a guide for AI-based crawlers and large language models, similar to how robots.txt guides traditional web crawlers. Implementing an llms.txt file offers several benefits:
- Enhanced AI Discoverability: Help AI systems better understand your site structure and important content
- Improved Content Representation: Ensure AI models accurately represent your website's content in search results and responses
- Training Data Optimization: Guide LLMs on which content to prioritize for training and responses
- Reduced Hallucination Risk: Provide structured, accurate data to reduce the risk of AI models generating incorrect information about your site
- Better Resource Allocation: Indicate which parts of your site are most relevant for AI training and responses
The tool allows you to customize the llms.txt output to optimize for specific AI models and search algorithms:
- Content Prioritization: Order your navigation links by importance to guide AI crawlers to your most valuable content first
- Content Categorization: Group similar content together with headings and subheadings
- Content Annotation: Add brief descriptions to links to provide additional context
- Depth Control: Determine how deep AI models should crawl into your site hierarchy
- Expanded Metadata: Include additional metadata about page content, update frequency, or content authority
Example of a customized llms.txt section:
## Product Categories (High Priority)
- [Sustainable Products](/sustainable-products) - Our eco-friendly product line with carbon-neutral manufacturing
- [Bestsellers](/bestsellers) - Most popular products based on customer reviews and sales
## Knowledge Base (Medium Priority)
- [Tutorials](/tutorials) - Step-by-step guides for product usage
- [FAQs](/faqs) - Frequently asked questions about our products and servicesThe llms.txt file is an emerging standard for providing structured guidance to large language models about your website content. You can learn more about the standard at https://llmstxt.org/.
- Generate the file: Use this tool to extract your site's navigation and generate the base
llms.txtfile - Customize the output: Edit the file to prioritize important content and add context
- Create an expanded version: Consider creating an
llms-full.txtfile with more comprehensive site mapping - Place the file on your server: Upload the file to your website's root directory (e.g.,
https://example.com/llms.txt)
Add a reference to your llms.txt file in your robots.txt file:
# Standard robots directives
User-agent: *
Allow: /
# LLM guidance
LLM: https://example.com/llms.txt
LLM-full: https://example.com/llms-full.txt
For more detailed content guidance, you can reference individual Markdown files in your llms.txt or llms-full.txt:
# Company Overview
- [About Us](/about) - Learn about our company history and mission
- Detailed content: [about-us-detailed.md](/content/about-us-detailed.md)
# Product Information
- [Product Catalog](/products) - Browse our complete product lineup
- Technical specifications: [product-specs.md](/content/product-specs.md)For optimal results, combine your llms.txt implementation with structured data markup (JSON-LD, Schema.org) on your web pages to provide consistent signals to both traditional search engines and AI systems.
This tool accesses web content and extracts data from websites. When using this tool, please ensure you respect each website's terms of service and robots.txt directives:
-
Check robots.txt first: Before scraping any website, check its
/robots.txtfile (e.g.,https://example.com/robots.txt) to see if the site owner has specified any crawling restrictions. -
Honor disallow directives: Do not scrape pages or sections that are explicitly disallowed in the robots.txt file.
-
Implement rate limiting: To avoid overwhelming servers, consider implementing additional rate limiting when extracting data from multiple pages of the same domain.
-
Identify your scraper: Consider modifying the user agent to identify your scraper and provide contact information.
Example of adding robots.txt compliance to the code:
import requests
from urllib.robotparser import RobotFileParser
def is_allowed(url, user_agent="MyScraperBot"):
rp = RobotFileParser()
parsed_url = urlparse(url)
robots_url = f"{parsed_url.scheme}://{parsed_url.netloc}/robots.txt"
try:
rp.set_url(robots_url)
rp.read()
return rp.can_fetch(user_agent, url)
except:
# If robots.txt doesn't exist or can't be parsed, proceed with caution
return True
# Use this function before scraping any URL
if is_allowed(url_to_scrape):
# Proceed with scraping
else:
# Skip this URL or notify the user- This tool should only be used to extract publicly available information
- Do not use this tool to bypass paywalls, login requirements, or access restricted content
- Consider the load your scraping puts on the target websites, especially smaller sites
- When in doubt, contact the website owner for permission
Contributions are welcome! Please feel free to submit a Pull Request.
This project is licensed under the MIT License - see the LICENSE file for details.
