Vision-first • Privacy-conscious • CPU-friendly inference
Vizionary is a production-minded image & video narration system that turns webcam streams, uploaded images, and short video clips into human-friendly textual descriptions and explainable visual overlays. It’s engineered to deliver low-latency, real-world performance on modest infrastructure (2 vCPU, 16 GB RAM) while keeping developer ergonomics, reproducibility, and privacy at the core.
- What: Ultra-efficient Vision → Text service that streams contextual sentences and Grad-CAM style overlays every ~10–12s for live/video/image inputs ON CPU.
- Stack: Next.js + Shadcn UI + Tailwind (frontend) ● FastAPI + WebSocket streaming (backend) ● PyTorch/Timm training pipeline (model) ● Gradio demo & Hugging Face deployment utilities.
- Scale target: Optimized for CPU inference (2 vCPU, 16 GB RAM) with CPU export options (TorchScript, ONNX, quantized flows).
- Why it matters: Accessibility, surveillance automation, faster content workflows, and developer-facing explainability.
- Real-time streaming narration: Produces contextual textual descriptions from webcam or uploaded media, updating every ~10–12 seconds.
- Explainability: Grad-CAM style overlays + textual cues so you can see what the model attends to.
- CPU-friendly exports: TorchScript and ONNX exports with quantization options for efficient CPU inference.
- Modular pipeline: Config-driven training, evaluation, export, and demo — swap models or prompts without touching infra.
- Privacy-first: Explicit consent UI, opt-in telemetry, anonymized and encrypted exports, and local-first inference patterns.
- Designed for real-world deployments where cheap/low-power compute matters.
- Production practices baked in: rate-limiting, CORS, robust error handling, streaming backpressure handling for websockets.
- Developer-first demos (Gradio + local dev server) to accelerate onboarding and bug reproduction.
- Clearly separated training/inference/eval/export code paths so engineering teams can safely iterate.
Vizionary is intentionally modular. At a high level:
- Frontend — Next.js app (React) that connects via WebSocket to the backend for live streaming descriptions and shows overlay visualizations. UI built with Shadcn UI + Tailwind.
- Backend — FastAPI service that accepts image/video frames or media uploads, orchestrates inference calls, and streams text + overlay results over WebSocket. Responsible for rate-limiting, authentication (optional), and export hooks.
- Training — PyTorch-based training pipeline with config-driven hyperparameters, supports both train-from-scratch and fine-tune flows. Evaluation and Grad-CAM visualization utilities included.
- Export — Scripts to export to TorchScript and ONNX, plus tuning helpers for dynamic/static quantization.
The README intentionally omits internal model architecture details; the repository keeps configuration and safe model cards for reproducibility and sharing.
These commands assume you cloned the repo at the root
Vizionary/.
- Node.js >= 18, npm/yarn
- Python 3.10+
gitandgit-lfs(if pushing model artifacts)- Optional: GPU + CUDA for training (Colab recommended)
git clone https://github.com/Genious07/Vizionary.git
cd Vizionarycd frontend
npm install
# run in dev mode (default http://localhost:9002)
npm run devOpen: http://localhost:9002
Create a Python environment and install dependencies:
cd backend
python -m venv .venv
source .venv/bin/activate # macOS / Linux
# .venv\Scripts\activate on Windows
pip install -r requirements.txtRun the API server (FastAPI):
# default: http://127.0.0.1:8000
uvicorn src.main:app --reload --port 8000Run the Gradio demo (optional):
python src/demo.py --model checkpoints/model_traced.pt
# or python src/demo.py --model checkpoints/model.onnxThe frontend expects the websocket URL (e.g.
ws://localhost:8000/ws/describe) to be configured in.env.local.
This repo is Colab-first for training. The recommended workflow is:
- Choose runtime → GPU (for training) or CPU (for inference/demo).
- Install deps (in a Colab cell):
!pip install -q torch torchvision timm pytorch-grad-cam gradio huggingface-hub matplotlib tqdm-
Prepare dataset — ImageFolder layout (
data/train/<class>/*,data/val/<class>/*) or mount Google Drive. -
Train
python src/train.py --config configs/train.yamlconfigs/train.yaml includes: dataset path, lr, optimizer, epochs, batch size, seed, and flags to toggle pretrained weights.
- Evaluate & Visualize
python src/evaluate.py --checkpoint checkpoints/best_model.pth --image example.jpg --visualizeThis prints predicted label + confidence and saves a Grad-CAM overlay.
- Export for CPU
python src/export.py --checkpoint checkpoints/best_model.pth --format torchscript --out checkpoints/model_traced.pt
# or ONNX
python src/export.py --checkpoint checkpoints/best_model.pth --format onnx --out checkpoints/model.onnx- Optional quantization
- Dynamic quantization for CPU (PyTorch):
quantize_dynamicin export helpers. - ONNX INT8 requires a calibration set (scripts available under
src/utils/quantize.py).
huggingface-cli login- Create a model repo and a Space for the demo.
- Use
python src/push_to_hf.py --model-path checkpoints/model_traced.pt --repo your-username/vizionary-model.
This repo contains helper scripts to automate uploads and to create a simple model card. Keep private keys and secrets out of version control.
Add a .env file with the values below and build images for frontend/backend.
# .env.example
FRONTEND_PORT=9002
BACKEND_PORT=8000
MODEL_PATH=/srv/models/model_traced.ptExample Docker commands:
# build backend image
docker build -f docker/backend.Dockerfile -t vizionary-backend:latest .
# run
docker run -p 8000:8000 -e MODEL_PATH=/models/model_traced.pt -v /local/models:/models vizionary-backend:latestCloud providers: deploy backend as a small VM (2vCPU, 8–16GB RAM) or use serverless containers with persistent disk for model files.
HTTP
POST /api/upload— upload image/video → returns media IDGET /api/status/{media_id}— returns processing status and last result
WebSocket (/ws/describe)
- Client sends
{ "media_id": "...", "frame_rate": 1, "lang": "en" } - Server streams messages of shape
{ "timestamp": "...", "text": "Generated sentence...", "confidence": 0.87, "overlay_url": "https://.../overlays/123.png" }
Backpressure & reconnection: the client should reconnect with exponential backoff and provide last-seen timestamp to resume streaming from the last frame.
- Reduce input resolution (e.g., 160×160) for lower latency when high accuracy is not required.
- Batch offline workloads and use single-image optimized paths for webcam inference.
- Quantize exports (dynamic for PyTorch, INT8 for ONNX with calibration) for CPU speedups.
- Use ONNX Runtime with thread affinity/OMP settings tuned for the host CPU.
- Cache outputs for repeated/slow-changing frames to avoid redundant inferences.
Benchmarks (rough targets):
- Cold start time-to-first-text on 2vCPU: ~1–3s depending on export format and model size.
- Streaming cadence: text updates every 10–12s by design (configurable).
- User consent: Frontend includes a consent modal for live camera access and data retention choices.
- Local-first inference: Prefer local or on-prem inference to avoid sending raw frames to third-party services.
- Anonymization & opt-out: All telemetry is opt-in and anonymized; exports can be encrypted using a user-provided key.
- Explainability limits: Grad-CAM overlays are developer debugging aids, not definitive proofs of correctness — include human-in-the-loop for critical decisions.
Responsible use disclaimer: Do not use Vizionary for safety-critical decisions without rigorous testing and human oversight.
vizionary/
├── frontend/ # Next.js app (React + Shadcn UI)
├── backend/ # FastAPI app, inference orchestration
│ ├── src/
│ │ ├── main.py
│ │ ├── infer.py
│ │ ├── websocket.py
│ │ └── utils.py
├── src/ (training portion)
│ ├── train.py
│ ├── evaluate.py
│ ├── export.py
│ └── datasets.py
├── configs/
├── checkpoints/
├── demos/ # Gradio demo & notebooks
└── docs/
└── model_card.md
Q: WebSocket disconnects frequently.
- A: Implement exponential backoff and reconnect with last-seen timestamp. Ensure server
uvicorntimeout and proxy (nginx) timeouts are configured.
Q: Export produces slow CPU inference.
- A: Try dynamic quantization, lower resolution, or ONNX Runtime with INT8 quantization.
Q: Grad-CAM fails/errors.
- A: Ensure the selected convolutional block exists; use the auto-selection utility (
src/utils/gradcam_helper.py).
Q: Accuracy is low when training from scratch.
- A: Use more data or switch to fine-tune mode with a pretrained starting point (toggle in
configs/train.yaml).
- Fork → branch named
feat/*orfix/*→ include tests and examples → open a PR. - Keep changes small and focused. Add or update
docs/andconfigs/for any user-facing behavior changes. - Include reproducible steps for behavioral changes and performance benchmarks when modifying inference/export code.
- Avoid committing model weights or secrets. Use Git LFS for large files if necessary.
Vizionary is released under the MIT License. See LICENSE for details.
Thanks to the open-source ecosystem: PyTorch, Timm, Grad-CAM tooling, Hugging Face, Gradio, Next.js, Tailwind, and the many contributors across these projects.
Made by Satwik Singh(#7) — privacy-first, production-minded computer vision.