Phoenix - A Web Crawler

A CLI application that generates 'internal links' reports for any website on the internet by crawling each page of the site!

Usage

go build -o crawler
./crawler <URL> <maxConcurrency> <maxPages>

Example:

./crawler "https://example.com" 5 100

Arguments:

URL: The website to crawl (starting point)
maxConcurrency: Maximum number of concurrent HTTP requests (e.g., 5)
maxPages: Maximum number of pages to crawl (e.g., 100)

Output

The crawler generates a report.csv file with the following columns:

page_url: The normalized URL of the page
h1: The main heading (H1 tag) of the page
first_paragraph: The first paragraph of content (prioritizes main tag)
outgoing_link_urls: All links found on the page (semicolon-separated)
image_urls: All image URLs found on the page (semicolon-separated)

This CSV can be opened in Excel, Google Sheets, or any spreadsheet program for analysis.

Motivation

A web crawler tool or commonly referred as a bot is used by search engine companies to index new websites and update their search results! AI model companies like OpenAI, Meta, Gemini or ANthropic are using their own web search for creating the best query results as well as scraping data for training AI models.

This is my motivation for creating Phoenix which not only can scrape data but also generate link reports (a page that references other pages on the website), and add a markdown file(s) for each page crawled for llms to utilize.

Build Log

URL normalization (remove scheme and trailing slashes)
Extract H1 tags from HTML
Extract first paragraph from HTML (with main tag priority)
Extract all links from HTML pages (convert relative to absolute URLs)
Extract all image URLs from HTML (convert relative to absolute URLs)
Structured page data extraction (PageData struct)
CLI argument validation and handling
HTTP fetching with User-Agent header and content-type validation
Recursive web crawling with same-domain restriction
Link reference counting across entire website
Concurrent crawling with goroutines and WaitGroups
Thread-safe page tracking with mutexes
Configurable concurrency control via buffered channels
Configurable max pages limit to prevent excessive crawling
CSV report export with full page data (URL, H1, paragraph, links, images)

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
.gitignore		.gitignore
README.md		README.md
csv_report.go		csv_report.go
extract_html.go		extract_html.go
extract_html_test.go		extract_html_test.go
go.mod		go.mod
go.sum		go.sum
main.go		main.go
normalize_url.go		normalize_url.go
normalize_url_test.go		normalize_url_test.go
report.csv		report.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Phoenix - A Web Crawler

Usage

Output

Motivation

Build Log

About

Uh oh!

Releases

Packages

Languages

fyzanshaik/phoenix-crawler

Folders and files

Latest commit

History

Repository files navigation

Phoenix - A Web Crawler

Usage

Output

Motivation

Build Log

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages