An intelligent web crawler that gathers the latest AI/ML research papers, blog posts, and repository updates from major sources.
- Python 3.8 or higher
- pip package manager
# Clone the repository
git clone https://github.com/yourusername/ai-ml-content-crawler.git
cd ai-ml-content-crawler
# Install dependencies
pip install -r requirements.txt
# Install the package (optional)
pip install -e .# Option 1: Run as a module
python -m src
# Option 2: Use the CLI (if installed)
ai-ml-crawlerPYTHONPATH=/home/kevin/ai-ml-content-crawler/src python -m ai_ml_crawler
Detailed guides are available in the docs directory:
- Usage Guide: Step-by-step instructions
- Security Guidelines: Security features and practices
- Architecture Overview: System design and components
- Analysis Reports: Code quality and performance reports
- 8 Active Crawlers: OpenAI, Meta, Anthropic, GitHub, ArXiv, Google Scholar, Medium, HuggingFace
- Smart Caching: 15-day TTL for efficient performance
- Content Filtering: AI/ML relevance scoring with keyword matching
- Markdown Reports: Comprehensive output with metadata and scoring
- Anti-Detection: Browser profiles, rate limiting, and request randomization
- Async Processing: Concurrent crawler execution for speed
The crawler generates comprehensive markdown reports in the output/ directory with:
- Executive summary and statistics
- Categorized content by source
- Relevance scores and metadata
- Publication dates and tags
ai-ml-content-crawler/
├── src/ # Source code
│ ├── crawlers/ # Individual crawler implementations
│ ├── utils/ # Utility modules
│ └── config.py # Configuration
├── docs/ # Documentation
├── output/ # Generated reports
├── cache/ # Request cache
└── requirements.txt # Dependencies
Contributions are welcome! Please read our Contributing Guide for details on our code of conduct and the process for submitting pull requests.
This project is licensed under the MIT License - see the LICENSE file for details.