URL Crawler API

The crawler API helps backends retrieve webpage titles when direct requests fail and provides full content extraction for the AI team, ensuring reliable metadata and text access.

Features

Retrieve a single webpage title via GET /title/{url}
Retrieve multiple webpage titles via POST /titles
Extract webpage content via GET /content/{url}
Asynchronous processing for efficient request handling

Installation

Ensure you have Python 3.8+ installed, then install the required dependencies:

pip install fastapi nest_asyncio pydantic uvicorn

Running the Application

To run the application, use the following command:

python main.py

API Endpoints

Root Endpoint

URL: /
Method: GET
Description: Returns a welcome message.

Response:

{
  "message": "Hello, this is the root of the API. Try /title/{url} or /content/{url}"
}

Load Title

URL: /title/{url:path}
Method: GET
Description: Crawls the title of the specified URL for backend.
Path Parameters:
- url (string): The URL to crawl.

Example Request:

curl -X GET "http://127.0.0.1:8000/title/https://example.com"

Response:
```
{
  "title": "The title of the page"
}
```

Load Titles

URL: /titles
Method: POST
Description: Crawls the titles of multiple URLs for backend.
Request Body:
- urls (list of strings): A list of URLs to crawl.

Example Request:

{
  "urls": ["https://example.com", "https://example.org"]
}

Example cURL Request:

curl -X POST "http://127.0.0.1:8000/titles" -H "Content-Type: application/json" -d '{"urls":["https://example.com","https://example.org"]}'

Response:

[
  {
    "url": "https://example.com",
    "title": "The title of the page"
  },
  {
    "url": "https://example.org",
    "title": "Another page title"
  }
]

Load Content

URL: /content/{url:path}
Method: GET
Description: Crawls the (filtered) content of a webpage with markdown format for AI team.
Path Parameters:
- url (string): The URL to crawl.

Example Request:

curl -X GET "http://127.0.0.1:8000/content/https://example.com"

Response:

{
  "url": "https://example.com",
  "content": "# content...."
}

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
test_data		test_data
.gitignore		.gitignore
Dockerfile		Dockerfile
README.md		README.md
content_crawler.py		content_crawler.py
docker-compose.yml		docker-compose.yml
file.txt		file.txt
main.py		main.py
playwright.config.js		playwright.config.js
requirements.txt		requirements.txt
title_crawler.py		title_crawler.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

URL Crawler API

Features

Installation

Running the Application

API Endpoints

Root Endpoint

Load Title

Load Titles

Load Content

About

Uh oh!

Releases

Packages

Uh oh!

Languages

commonground-project/url-crawler

Folders and files

Latest commit

History

Repository files navigation

URL Crawler API

Features

Installation

Running the Application

API Endpoints

Root Endpoint

Load Title

Load Titles

Load Content

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages