Skip to content

Kevin-Whoo/ai-ml-content-crawler

Repository files navigation

🤖 AI/ML Content Crawler

Python 3.8+ License: MIT Code style: black Async

An intelligent web crawler that gathers the latest AI/ML research papers, blog posts, and repository updates from major sources.

🚀 Quick Start

Prerequisites

  • Python 3.8 or higher
  • pip package manager

Installation

# Clone the repository
git clone https://github.com/yourusername/ai-ml-content-crawler.git
cd ai-ml-content-crawler

# Install dependencies
pip install -r requirements.txt

# Install the package (optional)
pip install -e .

Running the Crawler

# Option 1: Run as a module
python -m src

# Option 2: Use the CLI (if installed)
ai-ml-crawler

Option 3: Run this complicated one

PYTHONPATH=/home/kevin/ai-ml-content-crawler/src python -m ai_ml_crawler

📚 Documentation

Detailed guides are available in the docs directory:

🚀 Key Features

  • 8 Active Crawlers: OpenAI, Meta, Anthropic, GitHub, ArXiv, Google Scholar, Medium, HuggingFace
  • Smart Caching: 15-day TTL for efficient performance
  • Content Filtering: AI/ML relevance scoring with keyword matching
  • Markdown Reports: Comprehensive output with metadata and scoring
  • Anti-Detection: Browser profiles, rate limiting, and request randomization
  • Async Processing: Concurrent crawler execution for speed

📑 Output

The crawler generates comprehensive markdown reports in the output/ directory with:

  • Executive summary and statistics
  • Categorized content by source
  • Relevance scores and metadata
  • Publication dates and tags

📊 Project Structure

ai-ml-content-crawler/
├── src/                  # Source code
│   ├── crawlers/        # Individual crawler implementations
│   ├── utils/           # Utility modules
│   └── config.py        # Configuration
├── docs/                 # Documentation
├── output/               # Generated reports
├── cache/                # Request cache
└── requirements.txt      # Dependencies

🤝 Contributing

Contributions are welcome! Please read our Contributing Guide for details on our code of conduct and the process for submitting pull requests.

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

🔗 Links

About

Intelligent web crawler for AI/ML research and content

Resources

Security policy

Stars

Watchers

Forks

Packages

No packages published

Languages