🤖 AI/ML Content Crawler

An intelligent web crawler that gathers the latest AI/ML research papers, blog posts, and repository updates from major sources.

🚀 Quick Start

Prerequisites

Python 3.8 or higher
pip package manager

Installation

# Clone the repository
git clone https://github.com/yourusername/ai-ml-content-crawler.git
cd ai-ml-content-crawler

# Install dependencies
pip install -r requirements.txt

# Install the package (optional)
pip install -e .

Running the Crawler

# Option 1: Run as a module
python -m src

# Option 2: Use the CLI (if installed)
ai-ml-crawler

Option 3: Run this complicated one

PYTHONPATH=/home/kevin/ai-ml-content-crawler/src python -m ai_ml_crawler

📚 Documentation

Detailed guides are available in the docs directory:

Usage Guide: Step-by-step instructions
Security Guidelines: Security features and practices
Architecture Overview: System design and components
Analysis Reports: Code quality and performance reports

🚀 Key Features

8 Active Crawlers: OpenAI, Meta, Anthropic, GitHub, ArXiv, Google Scholar, Medium, HuggingFace
Smart Caching: 15-day TTL for efficient performance
Content Filtering: AI/ML relevance scoring with keyword matching
Markdown Reports: Comprehensive output with metadata and scoring
Anti-Detection: Browser profiles, rate limiting, and request randomization
Async Processing: Concurrent crawler execution for speed

📑 Output

The crawler generates comprehensive markdown reports in the output/ directory with:

Executive summary and statistics
Categorized content by source
Relevance scores and metadata
Publication dates and tags

📊 Project Structure

ai-ml-content-crawler/
├── src/                  # Source code
│   ├── crawlers/        # Individual crawler implementations
│   ├── utils/           # Utility modules
│   └── config.py        # Configuration
├── docs/                 # Documentation
├── output/               # Generated reports
├── cache/                # Request cache
└── requirements.txt      # Dependencies

🤝 Contributing

Contributions are welcome! Please read our Contributing Guide for details on our code of conduct and the process for submitting pull requests.

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
.github/workflows		.github/workflows
analysis		analysis
docs		docs
src/ai_ml_crawler		src/ai_ml_crawler
tests		tests
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
AGENT.md		AGENT.md
Makefile		Makefile
README.md		README.md
pyproject.toml		pyproject.toml
pytest.ini		pytest.ini
requirements.txt		requirements.txt
run.sh		run.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

🤖 AI/ML Content Crawler

🚀 Quick Start

Prerequisites

Installation

Running the Crawler

Option 3: Run this complicated one

📚 Documentation

🚀 Key Features

📑 Output

📊 Project Structure

🤝 Contributing

📄 License

🔗 Links

About

Uh oh!

Releases

Packages

Languages

Kevin-Whoo/ai-ml-content-crawler

Folders and files

Latest commit

History

Repository files navigation

🤖 AI/ML Content Crawler

🚀 Quick Start

Prerequisites

Installation

Running the Crawler

Option 3: Run this complicated one

📚 Documentation

🚀 Key Features

📑 Output

📊 Project Structure

🤝 Contributing

📄 License

🔗 Links

About

Resources

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages