Skip to content

kiltia/panoptica

Repository files navigation

Panoptica Crawler

Overview

Panoptica is crawler with LLM onboard for intelligent data analysis. It is designed to automate the process of web scraping, extracting information from HTML pages, and supporting various parsing scenarios extracting metadata, links, page content, certificate data and more.

Note: It's still in active development and may be too weak for production use.

Main Components

  • Web Scraper — the core service that performs website requests, parses their content, and returns structured data.
  • Executor — a wrapper API service that proxies requests to the Web Scraper, adding additional rules (such as polite crawling, per-domain locking, integration with queues, state and result storage in Redis, etc.).

Web Scraper can be used as a standalone service, allowing you to interact directly with it for data collection and parsing. Alternatively, it can be used together with the Executor, which adds features such as queue management, locking, integration with message brokers, and state storage, enabling a more flexible and scalable architecture.


Purpose

  • Mass data collection from websites for analytics, monitoring, contact search, change detection, and other tasks.
  • Flexible scraping behavior configuration: choose between simple and headless modes (emulating a browser), control request frequency and concurrency, use ML to select scraping strategy.
  • Integration with queues and message brokers (e.g., Kafka, Redis PubSub) for building distributed and fault-tolerant data collection pipelines.
  • Support for polite scraping: avoid overloading sites, prevent multiple simultaneous requests to the same domain, respect delays between requests.
  • Caching and result storage to speed up repeated requests and reduce load on external resources.

How It Works

  • External clients send requests via HTTP or queue interface to the Executor.
  • The Executor manages queues, locks, and state, then forwards requests to the Web Scraper.
  • The Web Scraper parses the website and returns the result back to the client through the Executor.

Development

Prerequisites

  • Install Go

  • Make sure you have just installed for running project commands

  • (Optional) Local K8s — only required if you want to test the service in an environment close to production. For most local development and testing, Kubernetes is not needed.

  • (Optional) air — a live-reloading tool for Go projects. It can be used for hot reload of the build during local development. To use it, install air and then you can run

SERVER_PORT=8080 air # starts Web Scraper for development
SERVER_PORT=8012 air -c .air.executor.toml # starts Executor for development

This will start the specified service with automatic rebuild and reload on code changes.

Deployment

To deploy the server to your Kubernetes cluster, use the following command:

just k8s-apply <service> <environment> <namespace>
  • <service> — choose either simple, headless or executor
  • <environment> — specify the target environment (e.g., dev, prod)
  • <namespace> — specify the Kubernetes namespace for deployment

Example:

just k8s-apply service=scraper env=dev ns=default

This will deploy the Web Scraper service to the default namespace in your Kubernetes cluster using the dev environment configuration.


Configuration

Default configuration settings for both the Web Scraper and Executor services are defined in the config.go file of each respective component.

  • For the Web Scraper, see: internal/scraper/config.go
  • For the Executor, see: internal/executor/config.go

A detailed description of all configuration options and environment variables can be found in the following files:

Refer to these files for up-to-date and comprehensive documentation on configuring each service for your needs.


About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Packages

No packages published

Contributors 3

  •  
  •  
  •