Panoptica Crawler

Overview

Panoptica is crawler with LLM onboard for intelligent data analysis. It is designed to automate the process of web scraping, extracting information from HTML pages, and supporting various parsing scenarios extracting metadata, links, page content, certificate data and more.

Note: It's still in active development and may be too weak for production use.

Main Components

Web Scraper — the core service that performs website requests, parses their content, and returns structured data.
Executor — a wrapper API service that proxies requests to the Web Scraper, adding additional rules (such as polite crawling, per-domain locking, integration with queues, state and result storage in Redis, etc.).

Web Scraper can be used as a standalone service, allowing you to interact directly with it for data collection and parsing. Alternatively, it can be used together with the Executor, which adds features such as queue management, locking, integration with message brokers, and state storage, enabling a more flexible and scalable architecture.

Purpose

Mass data collection from websites for analytics, monitoring, contact search, change detection, and other tasks.
Flexible scraping behavior configuration: choose between simple and headless modes (emulating a browser), control request frequency and concurrency, use ML to select scraping strategy.
Integration with queues and message brokers (e.g., Kafka, Redis PubSub) for building distributed and fault-tolerant data collection pipelines.
Support for polite scraping: avoid overloading sites, prevent multiple simultaneous requests to the same domain, respect delays between requests.
Caching and result storage to speed up repeated requests and reduce load on external resources.

How It Works

External clients send requests via HTTP or queue interface to the Executor.
The Executor manages queues, locks, and state, then forwards requests to the Web Scraper.
The Web Scraper parses the website and returns the result back to the client through the Executor.

Development

Prerequisites

Install Go
Make sure you have just installed for running project commands
(Optional) Local K8s — only required if you want to test the service in an environment close to production. For most local development and testing, Kubernetes is not needed.
(Optional) air — a live-reloading tool for Go projects. It can be used for hot reload of the build during local development. To use it, install air and then you can run

SERVER_PORT=8080 air # starts Web Scraper for development
SERVER_PORT=8012 air -c .air.executor.toml # starts Executor for development

This will start the specified service with automatic rebuild and reload on code changes.

Deployment

To deploy the server to your Kubernetes cluster, use the following command:

just k8s-apply <service> <environment> <namespace>

<service> — choose either simple, headless or executor
<environment> — specify the target environment (e.g., dev, prod)
<namespace> — specify the Kubernetes namespace for deployment

Example:

just k8s-apply service=scraper env=dev ns=default

This will deploy the Web Scraper service to the default namespace in your Kubernetes cluster using the dev environment configuration.

Configuration

Default configuration settings for both the Web Scraper and Executor services are defined in the config.go file of each respective component.

For the Web Scraper, see: internal/scraper/config.go
For the Executor, see: internal/executor/config.go

A detailed description of all configuration options and environment variables can be found in the following files:

cmd/scraper/README.md — for Web Scraper configuration
cmd/executor/README.md — for Executor configuration

Refer to these files for up-to-date and comprehensive documentation on configuring each service for your needs.

Name		Name	Last commit message	Last commit date
Latest commit History 899 Commits
.github		.github
cmd		cmd
deploy		deploy
gen/protos		gen/protos
internal		internal
models		models
pkg		pkg
protos		protos
scripts		scripts
static		static
test		test
.air.executor.toml		.air.executor.toml
.air.toml		.air.toml
.gitattributes		.gitattributes
.gitignore		.gitignore
README.md		README.md
easyp.lock		easyp.lock
easyp.yaml		easyp.yaml
go.mod		go.mod
go.sum		go.sum
justfile		justfile

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Panoptica Crawler

Overview

Main Components

Purpose

How It Works

Development

Prerequisites

Deployment

Configuration

About

Uh oh!

Releases 4

Packages

Contributors 3

Uh oh!

Languages

kiltia/panoptica

Folders and files

Latest commit

History

Repository files navigation

Panoptica Crawler

Overview

Main Components

Purpose

How It Works

Development

Prerequisites

Deployment

Configuration

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases 4

Packages 0

Contributors 3

Uh oh!

Languages

Packages