Skip to content
/ gurl Public

An intelligent web crawler built in Go that extracts email addresses from websites with precision and speed.

Notifications You must be signed in to change notification settings

luisra51/gurl

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

11 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

 โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ•— โ–ˆโ–ˆโ•—   โ–ˆโ–ˆโ•—โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ•— โ–ˆโ–ˆโ•—     
โ–ˆโ–ˆโ•”โ•โ•โ•โ•โ• โ–ˆโ–ˆโ•‘   โ–ˆโ–ˆโ•‘โ–ˆโ–ˆโ•”โ•โ•โ–ˆโ–ˆโ•—โ–ˆโ–ˆโ•‘     
โ–ˆโ–ˆโ•‘  โ–ˆโ–ˆโ–ˆโ•—โ–ˆโ–ˆโ•‘   โ–ˆโ–ˆโ•‘โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ•”โ•โ–ˆโ–ˆโ•‘     
โ–ˆโ–ˆโ•‘   โ–ˆโ–ˆโ•‘โ–ˆโ–ˆโ•‘   โ–ˆโ–ˆโ•‘โ–ˆโ–ˆโ•”โ•โ•โ–ˆโ–ˆโ•—โ–ˆโ–ˆโ•‘     
โ•šโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ•”โ•โ•šโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ•”โ•โ–ˆโ–ˆโ•‘  โ–ˆโ–ˆโ•‘โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ•—
 โ•šโ•โ•โ•โ•โ•โ•  โ•šโ•โ•โ•โ•โ•โ• โ•šโ•โ•  โ•šโ•โ•โ•šโ•โ•โ•โ•โ•โ•โ•

GURL - Go URL Email Crawler

An intelligent web crawler built in Go that extracts email addresses from websites with precision and speed.

Go Version Docker Redis License

๐Ÿš€ Fast, intelligent, and scalable email discovery for modern web applications

โœจ Features

  • ๐Ÿง  Intelligent Crawling: Prioritizes contact and information pages
  • ๐ŸŒ Multi-language Support: Recognizes keywords in 6 languages (Spanish, English, French, German, Italian, Portuguese)
  • ๐Ÿ”„ Meta Redirects: Automatically follows HTML meta redirects
  • โšก Redis Cache: Smart caching with 12-month persistence and 5,400x speed improvement
  • ๐Ÿš€ Async Processing: Background jobs with webhook notifications
  • ๐Ÿ” Auto Deduplication: Automatically removes duplicate emails
  • ๐Ÿณ Dockerized: Easy deployment with Docker Compose
  • ๐Ÿ“ก REST API: Both synchronous and asynchronous endpoints
  • โš™๏ธ Configurable Depth: Explore up to 3 levels deep (configurable)

โ˜• Support

If you like this project, consider buying me a coffee โ˜•๐Ÿ’› Buy Me A Coffee

๐Ÿ“‹ Requirements

  • Docker
  • Docker Compose

๐Ÿš€ Quick Start

Option 1: Use Pre-built Docker Image (Recommended)

# Pull and run the latest image
docker run -d --name gurl-crawler \
  -p 8080:8080 \
  -p 6379:6379 \
  luisra51/gurl:latest

# Or use with external Redis
docker run -d --name gurl-crawler \
  -p 8080:8080 \
  -e REDIS_HOST=your-redis-host \
  -e REDIS_PORT=6379 \
  luisra51/gurl:latest

Option 2: Clone and Build from Source

git clone https://github.com/luisra51/gurl.git
cd gurl
docker-compose up --build

2. Use the API

Service will be available at http://localhost:8080

Synchronous Scanning (Immediate Response)

# Basic scan
curl "http://localhost:8080/scan?url=example.com"

# With specific protocol
curl "http://localhost:8080/scan?url=https://company.com"

Response:

{
  "emails": ["[email protected]", "[email protected]"],
  "from_cache": false,
  "crawl_time": "2.3s"
}

Asynchronous Scanning (For Slow URLs)

curl -X POST "http://localhost:8080/scan/async" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "slow-website.com",
    "webhook_url": "https://your-api.com/webhook",
    "callback_id": "optional-tracking-id"
  }'

Immediate Response:

{
  "job_id": "uuid-123-456-789",
  "status": "queued",
  "estimated_time": "30-60s",
  "webhook_url": "https://your-api.com/webhook",
  "check_status_url": "/scan/status/uuid-123-456-789"
}

Webhook Callback (When Complete):

{
  "job_id": "uuid-123-456-789",
  "callback_id": "optional-tracking-id",
  "status": "completed",
  "url": "https://slow-website.com",
  "emails": ["[email protected]"],
  "crawl_time": "45.2s",
  "pages_visited": 15,
  "completed_at": "2025-08-07T10:30:00Z"
}

3. Response Types

Success with Emails Found:

{
  "emails": ["[email protected]", "[email protected]"],
  "from_cache": true,
  "crawl_time": "396ยตs"
}

Success without Emails:

{
  "emails": [],
  "from_cache": false,
  "crawl_time": "2.1s"
}

Error:

{
  "error": "Invalid URL provided"
}

๐ŸŒ Multi-language Support

The crawler intelligently recognizes contact-related keywords in 6 languages:

  • ๐Ÿ‡ช๐Ÿ‡ธ Spanish: contacto, informaciรณn, equipo, nosotros, empresa
  • ๐Ÿ‡บ๐Ÿ‡ธ English: contact, about, team, support, help, office
  • ๐Ÿ‡ซ๐Ÿ‡ท French: nous-contacter, รฉquipe, aide, assistance, bureau
  • ๐Ÿ‡ฉ๐Ÿ‡ช German: kontakt, รผber-uns, impressum, unser-team, hilfe
  • ๐Ÿ‡ฎ๐Ÿ‡น Italian: contatti, chi-siamo, squadra, informazioni, supporto
  • ๐Ÿ‡ต๐Ÿ‡น Portuguese: contato, sobre-nos, equipe, ajuda, suporte

43+ keywords total across all languages for maximum coverage

๐Ÿ”Œ API Endpoints

Synchronous Endpoints

Method Endpoint Description
GET /scan?url=<website> Scan website (immediate response)
GET /cache/stats View Redis cache statistics
DELETE /cache/invalidate Clear all cache
DELETE /cache/invalidate?url=<website> Clear specific URL cache

Asynchronous Endpoints

Method Endpoint Description
POST /scan/async Create async scan job
GET /scan/status/<job_id> Check job status
DELETE /scan/cancel/<job_id> Cancel queued job
GET /scan/jobs View active job statistics

Advanced Usage Examples

# View cache statistics
curl "http://localhost:8080/cache/stats"

# Check async job status
curl "http://localhost:8080/scan/status/uuid-123-456"

# Cancel queued job
curl -X DELETE "http://localhost:8080/scan/cancel/uuid-123-456"

# View active jobs and statistics
curl "http://localhost:8080/scan/jobs"

# Clear complete cache
curl -X DELETE "http://localhost:8080/cache/invalidate"

โš™๏ธ Configuration

Environment Variables

# Crawler Settings
CRAWLER_MAX_DEPTH=3                    # Maximum crawling depth
CRAWLER_DEDUPLICATE_EMAILS=true       # Remove duplicate emails

# Cache Settings  
CACHE_ENABLED=true                     # Enable Redis cache
CACHE_EXPIRATION_MONTHS=12             # Cache TTL in months

# Async Processing Settings
ASYNC_ENABLED=true                     # Enable async processing
ASYNC_WORKERS=3                        # Number of parallel workers
ASYNC_JOB_TIMEOUT_SECONDS=300          # Job timeout (5 minutes)
ASYNC_WEBHOOK_RETRIES=3                # Webhook retry attempts

# Redis Configuration
REDIS_HOST=localhost                   # Redis host
REDIS_PORT=6379                        # Redis port
REDIS_PERSIST_DISK=false              # Disk persistence (prod: true)

# Server Configuration
SERVER_PORT=8080                       # Server port
SERVER_HOST=0.0.0.0                   # Server host

How It Works

  • ๐ŸŽฏ Smart Crawling: Prioritizes contact pages with multilingual keywords
  • ๐Ÿ“Š Depth Control: Configurable depth (default: 3 levels)
  • โšก Cache System: Redis-based caching with 12-month TTL
  • ๐Ÿ”„ Auto Deduplication: Automatic email normalization and deduplication
  • ๐Ÿš€ Performance: 5,400x faster responses with cache hits

๐Ÿ—๏ธ Project Architecture

/
โ”œโ”€โ”€ .env                     # Environment variables (development)
โ”œโ”€โ”€ .env.example             # Configuration example
โ”œโ”€โ”€ go.mod                   # Go dependencies
โ”œโ”€โ”€ Dockerfile               # Container definition
โ”œโ”€โ”€ docker-compose.yml       # Redis + App services
โ”œโ”€โ”€ scan_urls.sh             # Batch processing script
โ”œโ”€โ”€ cmd/
โ”‚   โ””โ”€โ”€ crawler/
โ”‚       โ””โ”€โ”€ main.go          # Application entry point
โ””โ”€โ”€ internal/
    โ”œโ”€โ”€ cache/
    โ”‚   โ””โ”€โ”€ cache.go         # Redis cache management
    โ”œโ”€โ”€ config/
    โ”‚   โ””โ”€โ”€ config.go        # Environment configuration
    โ”œโ”€โ”€ crawler/
    โ”‚   โ””โ”€โ”€ crawler.go       # Core crawling logic
    โ”œโ”€โ”€ handler/
    โ”‚   โ””โ”€โ”€ handler.go       # HTTP endpoints (sync + async)
    โ””โ”€โ”€ jobs/
        โ”œโ”€โ”€ types.go         # Job data types
        โ”œโ”€โ”€ queue.go         # Redis job queue
        โ””โ”€โ”€ worker.go        # Worker system + webhooks

Core Components

  • ๐Ÿ—„๏ธ Cache Layer: Redis with configurable TTL and optional persistence
  • โš™๏ธ Job Queue: Redis-based async system with parallel workers
  • ๐Ÿ“ก Webhook System: Result delivery with retries and exponential backoff
  • ๐ŸŒ Multi-language: 43+ keywords across 6 languages
  • ๐Ÿ”ง Config Management: Environment-based configuration

๐Ÿ”ง Development

With Docker (Recommended)

# Copy environment variables
cp .env.example .env

# Start complete stack
docker-compose up --build

Without Docker

# Install Redis locally
# Ubuntu/Debian: sudo apt install redis-server
# macOS: brew install redis

# Start Redis
redis-server

# Install Go dependencies
go mod tidy

# Run application
go run cmd/crawler/main.go

๐Ÿค Contributing

We welcome contributions! Here's how you can help:

Ways to Contribute

  • ๐Ÿ› Bug Reports: Found a bug? Open an issue
  • โœจ Feature Requests: Have an idea? Start a discussion
  • ๐Ÿ“ Documentation: Improve docs, add examples, fix typos
  • ๐ŸŒ Translations: Add support for more languages
  • ๐Ÿงช Testing: Write tests, test edge cases
  • ๐Ÿ’ป Code: Implement new features or fix bugs

Development Setup

  1. Fork the repository
  2. Clone your fork:
    git clone https://github.com/your-username/gurl.git
    cd gurl
  3. Create a feature branch:
    git checkout -b feature/amazing-feature
  4. Make your changes
  5. Test your changes:
    docker-compose up --build
    # Test your changes
  6. Commit and push:
    git commit -m "Add amazing feature"
    git push origin feature/amazing-feature
  7. Open a Pull Request

Code Style

  • Follow standard Go conventions (go fmt, go vet)
  • Add tests for new features
  • Update documentation for API changes
  • Use meaningful commit messages

๐Ÿ“ Limitations

  • JavaScript: Does not execute JavaScript, only analyzes static HTML
  • Single Page Applications: Limited on SPAs that load content dynamically
  • Rate limiting: Does not implement throttling between requests
  • Same domain: Only crawls pages from the same base domain

๐Ÿš€ Use Cases

  • ๐Ÿ’ผ Lead Generation: Find contact emails from company websites
  • ๐Ÿ” Research Automation: Collect contact information at scale
  • ๐Ÿ“Š Competitive Analysis: Study competitor contact pages
  • ๐Ÿ”— API Integration: Integrate with CRMs via webhooks
  • ๐Ÿ“ฆ Batch Processing: Process thousands of URLs with scan_urls.sh
  • ๐Ÿ—๏ธ Microservices: Email discovery service for distributed architectures

๐Ÿณ Docker

Using Docker Hub Image (Production)

Docker Hub Docker Pulls

# Single container (no Redis persistence)
docker run -d --name gurl-crawler \
  -p 8080:8080 \
  luisra51/gurl:latest

# With Docker Compose (includes Redis)
docker-compose -f docker-compose.hub.yml up -d

# Production with external Redis
docker run -d --name gurl-crawler \
  -p 8080:8080 \
  -e REDIS_HOST=your-redis-host \
  -e REDIS_PORT=6379 \
  -e REDIS_PERSIST_DISK=true \
  -e ASYNC_WORKERS=5 \
  -e CACHE_EXPIRATION_MONTHS=12 \
  luisra51/gurl:latest

Development (from source)

# Quick development (no persistence)
docker-compose up --build

# Fast rebuilds
docker-compose up --build crawler-app

# Clean and start fresh
docker-compose down -v && docker-compose up --build

Manual build

docker build -t email-crawler .
docker run -p 8080:8080 email-crawler

๐Ÿ” Monitoring and Debugging

# View cache statistics
curl "http://localhost:8080/cache/stats"

# View worker and job status
curl "http://localhost:8080/scan/jobs"

# Application logs
docker-compose logs -f crawler-app

# Redis logs
docker-compose logs -f redis

# Enter container for debugging
docker-compose exec crawler-app sh

About

An intelligent web crawler built in Go that extracts email addresses from websites with precision and speed.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Sponsor this project

Packages

No packages published