LLMS Generator Toolkit 🤖

⚡ Convert Websites into High-Quality Training Data for LLMs & AI-First SEO Build optimized llms.txt files and clean Markdown datasets from real website content — perfect for LLM pipelines, AI-ready sitemaps, and content-aware SEO.

Main Features

Navigation Extraction: Scrape and extract navigation elements from websites using customizable CSS selectors to generate a llms.txt preview that you can edit to further optimize before downloading
Link Conversion: Transform links in various formats (Markdown, HTML, raw URLs) into a standardized format
URL to Markdown: Process URLs to extract key content and convert it to Markdown format

Why Use This Toolkit?

🧠 Create better training datasets for custom LLMs and retrieval-augmented generation (RAG) systems
🔍 Make your content more accessible to generative search engines and AI assistants
🛠️ Quickly generate structured Markdown from live websites — no manual cleanup
📁 Design AI-first sitemaps (llms.txt) to guide crawlers and reduce hallucination

Who This Is For

SEO professionals building llms.txt files for AI search engines
AI engineers preparing website-derived training datasets
Product teams designing AI-friendly site structures
Anyone exploring the intersection of SEO and LLMs

Live Demo

Try it here → llms-generator-toolkit.onrender.com

Suggested Use Cases to Try:

Generate a semantic site map from any homepage navigation
Convert links into AI-readable Markdown
Extract clean Markdown from product/category/landing pages

Note: The demo is hosted on Render's free tier plan, so the navigation extraction feature may be slow to respond. This is due to free tier limitations rather than issues with the application itself. For faster performance, consider running the application locally.

🤓 Tech Highlights

Built with Dash for reactive Python UI
Uses Playwright to scrape dynamic websites
HTML parsing via BeautifulSoup + semantic Markdown generation
Runs on Render with Chromium sandboxing enabled

🔧 Installation

📋 Prerequisites

Python 3.8+ (Python 3.11+ recommended)
pip package manager
Playwright (installed automatically)

📦 Local Setup

1. Clone this repository:

git clone https://github.com/jeredhiggins/llms-generator-toolkit.git
cd llms-generator-toolkit

2. Create and activate a virtual environment:

python -m venv .venv
source .venv/bin/activate  # On Windows: .venv\Scripts\activate

3. Install Python dependencies:

pip install --upgrade pip
pip install -r requirements.txt

4. Install the Playwright browser (Chromium only):

python -m playwright install chromium

🚀 Usage

Start the application:

python app.py

Then open your browser at:

http://127.0.0.1:8050

🌐 Deployment on Render

This project is configured to deploy easily on Render.

Key Notes:

Chromium is installed manually during the build without --with-deps
The browser cache is stored at /opt/render/.cache/ms-playwright
No system-level dependencies are required due to a clean headless setup

Your render.yaml handles all of this automatically:

buildCommand: |
  python -m venv .venv
  source .venv/bin/activate
  pip install --upgrade pip
  pip install -r requirements.txt
  mkdir -p /opt/render/.cache/ms-playwright
  python -m playwright install --force chromium

The app is started using gunicorn:

startCommand: .venv/bin/gunicorn --bind 0.0.0.0:$PORT --timeout 600 app:server

Environment variables:

envVars:
  - key: PYTHONUNBUFFERED
    value: true
  - key: PLAYWRIGHT_BROWSERS_PATH
    value: /opt/render/.cache/ms-playwright

Navigation Extraction

Extract the navigation structure from a website:

Enter the homepage URL (e.g., https://www.lego.com/en-ie)
Specify the root navigation selector (e.g., nav[data-test="main-navigation"])
Optionally, provide a context selector to refine link extraction
Click "Extract Navigation"
View the Markdown-formatted navigation structure
Further customize/edit the generated markdown
Download the result as llms.txt

Link Conversion

Convert links from various formats to a standardized format:

Paste Markdown links, HTML links, or raw URLs in the input area
Click "Convert Links"
The converted links will appear in standardized Markdown format
Copy and use the converted links as needed

URL to Markdown

Process URLs to extract key content in Markdown format:

Enter one or more URLs (one per line) in the input area
Click "Convert URLs"
View the extracted Markdown content for each URL
Download the results as individual Markdown files or a ZIP archive

Search Optimization Benefits

LLMs.txt for AI Crawlers and Search

The llms.txt file generated by this tool serves as a guide for AI-based crawlers and large language models, similar to how robots.txt guides traditional web crawlers. Implementing an llms.txt file offers several benefits:

Enhanced AI Discoverability: Help AI systems better understand your site structure and important content
Improved Content Representation: Ensure AI models accurately represent your website's content in search results and responses
Training Data Optimization: Guide LLMs on which content to prioritize for training and responses
Reduced Hallucination Risk: Provide structured, accurate data to reduce the risk of AI models generating incorrect information about your site
Better Resource Allocation: Indicate which parts of your site are most relevant for AI training and responses

Customizing Your llms.txt Output

The tool allows you to customize the llms.txt output to optimize for specific AI models and search algorithms:

Content Prioritization: Order your navigation links by importance to guide AI crawlers to your most valuable content first
Content Categorization: Group similar content together with headings and subheadings
Content Annotation: Add brief descriptions to links to provide additional context
Depth Control: Determine how deep AI models should crawl into your site hierarchy
Expanded Metadata: Include additional metadata about page content, update frequency, or content authority

Example of a customized llms.txt section:

## Product Categories (High Priority)
- [Sustainable Products](/sustainable-products) - Our eco-friendly product line with carbon-neutral manufacturing
- [Bestsellers](/bestsellers) - Most popular products based on customer reviews and sales

## Knowledge Base (Medium Priority)
- [Tutorials](/tutorials) - Step-by-step guides for product usage
- [FAQs](/faqs) - Frequently asked questions about our products and services

Implementing llms.txt on Your Website

What is llms.txt?

The llms.txt file is an emerging standard for providing structured guidance to large language models about your website content. You can learn more about the standard at https://llmstxt.org/.

Setting Up llms.txt

Generate the file: Use this tool to extract your site's navigation and generate the base llms.txt file
Customize the output: Edit the file to prioritize important content and add context
Create an expanded version: Consider creating an llms-full.txt file with more comprehensive site mapping
Place the file on your server: Upload the file to your website's root directory (e.g., https://example.com/llms.txt)

Reference in robots.txt

Add a reference to your llms.txt file in your robots.txt file:

# Standard robots directives
User-agent: *
Allow: /

# LLM guidance
LLM: https://example.com/llms.txt
LLM-full: https://example.com/llms-full.txt

Reference Markdown Files

For more detailed content guidance, you can reference individual Markdown files in your llms.txt or llms-full.txt:

# Company Overview
- [About Us](/about) - Learn about our company history and mission
  - Detailed content: [about-us-detailed.md](/content/about-us-detailed.md)
  
# Product Information
- [Product Catalog](/products) - Browse our complete product lineup
  - Technical specifications: [product-specs.md](/content/product-specs.md)

Structured Data Integration

For optimal results, combine your llms.txt implementation with structured data markup (JSON-LD, Schema.org) on your web pages to provide consistent signals to both traditional search engines and AI systems.

Responsible Web Scraping

Respecting robots.txt

This tool accesses web content and extracts data from websites. When using this tool, please ensure you respect each website's terms of service and robots.txt directives:

Check robots.txt first: Before scraping any website, check its /robots.txt file (e.g., https://example.com/robots.txt) to see if the site owner has specified any crawling restrictions.
Honor disallow directives: Do not scrape pages or sections that are explicitly disallowed in the robots.txt file.
Implement rate limiting: To avoid overwhelming servers, consider implementing additional rate limiting when extracting data from multiple pages of the same domain.
Identify your scraper: Consider modifying the user agent to identify your scraper and provide contact information.

Example of adding robots.txt compliance to the code:

import requests
from urllib.robotparser import RobotFileParser

def is_allowed(url, user_agent="MyScraperBot"):
    rp = RobotFileParser()
    parsed_url = urlparse(url)
    robots_url = f"{parsed_url.scheme}://{parsed_url.netloc}/robots.txt"
    
    try:
        rp.set_url(robots_url)
        rp.read()
        return rp.can_fetch(user_agent, url)
    except:
        # If robots.txt doesn't exist or can't be parsed, proceed with caution
        return True

# Use this function before scraping any URL
if is_allowed(url_to_scrape):
    # Proceed with scraping
else:
    # Skip this URL or notify the user

Legal and Ethical Considerations

This tool should only be used to extract publicly available information
Do not use this tool to bypass paywalls, login requirements, or access restricted content
Consider the load your scraping puts on the target websites, especially smaller sites
When in doubt, contact the website owner for permission

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 44 Commits
images		images
LICENSE		LICENSE
README.md		README.md
app.py		app.py
render.yaml		render.yaml
requirements.txt		requirements.txt

License

jeredhiggins/llms-generator-toolkit

Folders and files

Latest commit

History

Repository files navigation

LLMS Generator Toolkit 🤖

Main Features

Why Use This Toolkit?

Who This Is For

Live Demo

Suggested Use Cases to Try:

🤓 Tech Highlights

🔧 Installation

📋 Prerequisites

📦 Local Setup

1. Clone this repository:

2. Create and activate a virtual environment:

3. Install Python dependencies:

4. Install the Playwright browser (Chromium only):

🚀 Usage

Start the application:

🌐 Deployment on Render

Key Notes:

Environment variables:

Navigation Extraction

Link Conversion

URL to Markdown

Search Optimization Benefits

LLMs.txt for AI Crawlers and Search

Customizing Your llms.txt Output

Implementing llms.txt on Your Website

What is llms.txt?

Setting Up llms.txt

Reference in robots.txt

Reference Markdown Files

Structured Data Integration

Responsible Web Scraping

Respecting robots.txt

Legal and Ethical Considerations

Contributing

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages