HowYouSeeMe - Advanced ROS2 Computer Vision System

A production-ready ROS2 computer vision system featuring Kinect v2, RTABMap SLAM, and 5 AI models for real-time perception, segmentation, detection, face recognition, and emotion analysis.

🎯 Overview

HowYouSeeMe is a complete computer vision system built on ROS2 Humble, providing real-time 3D perception, object detection, segmentation, face recognition, and emotion analysis. The system integrates multiple state-of-the-art AI models with Kinect v2 RGB-D sensing and RTABMap SLAM for comprehensive spatial understanding.

🌍 Vision: World State Perception System

HowYouSeeMe is the perception foundation for an intelligent robotics ecosystem that combines:

Computer Vision Models: YOLO, SAM, Segmentation, VLMs
SLAM & Mapping: RTABMap for 3D spatial understanding
IMU Fusion: BlueLily integration for enhanced localization
World State Summarizer: Unified interface combining all active models
MCP Server: Model Context Protocol for LLM integration
Visual Memory System: Persistent object tracking and spatial memory

The goal is to create a unified world state that any LLM can query to understand the robot's environment, remember object locations, and make informed decisions based on real-time perception.

✨ Key Features

🤖 5 AI Models - Unified Pipeline

SAM2 (Segment Anything Model 2) - Real-time segmentation
- Point, box, and everything modes
- Streaming support up to 30 FPS
- Optimized for 4GB GPUs (0.28GB VRAM)
FastSAM - Fast segmentation with text prompts
- Natural language descriptions
- Multiple prompt types (point, box, text)
- Real-time performance
YOLO11 - Multi-task detection
- Object detection
- Instance segmentation
- Pose estimation
- Oriented bounding boxes (OBB)
InsightFace - Face recognition & liveness
- Face detection and recognition
- Face database management
- Liveness detection (anti-spoofing)
- Age and gender estimation
Emotion Detection (FER) - 7 emotions
- Happy, Sad, Angry, Surprise, Fear, Disgust, Neutral
- Real-time streaming
- Multi-face support
- Color-coded visualization

🎮 Interactive Menu System

./cv_menu.sh  # Launch interactive menu

CV Pipeline - Model Selection
========================================

Select a Model:

  1) 🎯 SAM2 - Segment Anything Model 2
  2) ⚡ FastSAM - Faster SAM with Text Prompts
  3) 🔍 YOLO11 - Detection, Pose, Segmentation, OBB
  4) 👤 InsightFace - Face Recognition & Liveness
  5) 😊 Emotion Detection - 7 Emotions (FER)
  6) 📊 [Future] Depth Anything
  7) 🧠 [Future] DINO Features

System Commands:
  8) 📋 List Available Models
  9) 🛑 Stop Active Streaming

🗺️ RTABMap SLAM Integration

Real-time 3D mapping and localization
Loop closure detection
RGB-D odometry
Point cloud generation
TF2 coordinate transforms

📡 Kinect v2 Bridge

14.5 FPS RGB-D streaming
Multiple resolutions (HD, QHD, SD)
CUDA-accelerated processing
30+ ROS2 topics
Calibrated depth and color alignment

🏗️ System Architecture

┌─────────────────────────────────────────────────────────────────┐
│                    HowYouSeeMe System                           │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  ┌──────────────┐    ┌──────────────┐    ┌──────────────┐     │
│  │ Kinect v2    │───▶│ kinect2      │───▶│ ROS2 Topics  │     │
│  │ RGB-D Sensor │    │ bridge       │    │ 30+ streams  │     │
│  └──────────────┘    └──────────────┘    └──────────────┘     │
│                             │                     │            │
│                             ▼                     ▼            │
│  ┌──────────────┐    ┌──────────────┐    ┌──────────────┐     │
│  │ RTABMap      │◀───│ CV Pipeline  │◀───│ AI Models    │     │
│  │ SLAM         │    │ Server V2    │    │ (5 models)   │     │
│  └──────────────┘    └──────────────┘    └──────────────┘     │
│         │                    │                    │            │
│         ▼                    ▼                    ▼            │
│  ┌──────────────┐    ┌──────────────┐    ┌──────────────┐     │
│  │ 3D Map       │    │ Visualization│    │ Results      │     │
│  │ /rtabmap/... │    │ /cv_pipeline │    │ JSON + Image │     │
│  └──────────────┘    └──────────────┘    └──────────────┘     │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

� QuCick Start

Prerequisites

# System Requirements
- Ubuntu 22.04 LTS
- ROS2 Humble
- NVIDIA GPU with CUDA 12.6+
- Microsoft Kinect v2
- 8GB+ RAM
- Anaconda/Miniconda

Installation

Clone Repository

git clone https://github.com/AryanRai/HowYouSeeMe.git
cd HowYouSeeMe

Install Dependencies

# Install Kinect v2 drivers
./install_kinect_drivers.sh

# Install ROS2 packages
cd ros2_ws
colcon build
source install/setup.bash

# Install AI models (in conda environment)
conda activate howyouseeme
./install_sam2.sh
./install_fastsam.sh
./install_yolo11.sh
./install_insightface.sh

Launch System

# Full system (Kinect + SLAM + CV Pipeline + RViz)
./launch_full_system_rviz.sh

# Or just Kinect + CV Pipeline
./launch_kinect_sam2_server.sh

Use Interactive Menu

./cv_menu.sh

📖 Documentation

Quick Start Guides

Getting Started - First-time setup
Quick Start CV Pipeline - CV system basics
CV Pipeline V2 Guide - Complete pipeline documentation

Hardware Setup

Kinect v2 ROS2 Bridge - Sensor setup and calibration
Kinect v2 ROS Humble - ROS2 integration details

SLAM & Navigation

SLAM Quick Reference - RTABMap commands
SLAM Integration - Full SLAM setup
SLAM Performance - Optimization tips

AI Models

SAM2 Integration - Segmentation model
FastSAM Guide - Fast segmentation
YOLO11 Integration - Detection and pose
InsightFace Complete - Face recognition
Emotion Detection - Emotion analysis

System Guides

Menu Guide - Interactive menu usage
Streaming Guide - Continuous streaming
Troubleshooting - Common issues
RViz Visualization - Visualization setup

🎯 Usage Examples

1. SAM2 Segmentation

# Point mode - segment object at coordinates
ros2 topic pub --once /cv_pipeline/model_request std_msgs/msg/String \
    "data: 'sam2:prompt_type=point,x=480,y=270'"

# Box mode - segment region
ros2 topic pub --once /cv_pipeline/model_request std_msgs/msg/String \
    "data: 'sam2:prompt_type=box,box=200,150,700,450'"

# Everything mode - segment all objects
ros2 topic pub --once /cv_pipeline/model_request std_msgs/msg/String \
    "data: 'sam2:prompt_type=everything'"

# Streaming mode
ros2 topic pub --once /cv_pipeline/model_request std_msgs/msg/String \
    "data: 'sam2:prompt_type=point,x=480,y=270,stream=true,duration=30,fps=5'"

2. YOLO11 Detection

# Object detection
ros2 topic pub --once /cv_pipeline/model_request std_msgs/msg/String \
    "data: 'yolo11:task=detect,conf=0.25'"

# Pose estimation
ros2 topic pub --once /cv_pipeline/model_request std_msgs/msg/String \
    "data: 'yolo11:task=pose,conf=0.25'"

# Instance segmentation
ros2 topic pub --once /cv_pipeline/model_request std_msgs/msg/String \
    "data: 'yolo11:task=segment,conf=0.25'"

3. Face Recognition

# Detect and recognize faces
ros2 topic pub --once /cv_pipeline/model_request std_msgs/msg/String \
    "data: 'insightface:mode=detect_recognize'"

# Register new person
ros2 topic pub --once /cv_pipeline/model_request std_msgs/msg/String \
    "data: 'insightface:mode=register,name=John_Doe'"

# Check liveness (anti-spoofing)
ros2 topic pub --once /cv_pipeline/model_request std_msgs/msg/String \
    "data: 'insightface:mode=liveness'"

4. Emotion Detection

# Single frame emotion detection
ros2 topic pub --once /cv_pipeline/model_request std_msgs/msg/String \
    "data: 'insightface:mode=emotion'"

# Stream emotions continuously
ros2 topic pub --once /cv_pipeline/model_request std_msgs/msg/String \
    "data: 'insightface:mode=emotion,stream=true,duration=30,fps=5'"

5. FastSAM with Text

# Segment using text description
ros2 topic pub --once /cv_pipeline/model_request std_msgs/msg/String \
    "data: 'fastsam:prompt_type=text,text=a photo of a dog'"

🔧 System Commands

Launch Scripts

# Full system with visualization
./launch_full_system_rviz.sh

# Kinect + CV Pipeline only
./launch_kinect_sam2_server.sh

# SLAM with IMU
./launch_kinect2_slam_with_imu.sh

Utility Scripts

# Interactive menu
./cv_menu.sh

# Stop all processes
./kill_all.sh

# Stop streaming
./stop_cv_streaming.sh

# Test emotion detection
./test_emotion_detection.sh

Monitoring

# View results
ros2 topic echo /cv_pipeline/results

# Watch visualization
# In RViz: Add Image display for /cv_pipeline/visualization

# Monitor performance
ros2 topic hz /cv_pipeline/results

📊 Performance

Processing Times

SAM2 Tiny: ~0.7s per frame (0.28GB VRAM)
YOLO11: ~0.1-0.3s per frame
InsightFace: ~0.3-0.5s per frame
Emotion Detection: ~0.5s per frame
FastSAM: ~0.2-0.4s per frame

Streaming Performance

Recommended FPS: 2-5 for AI models
Kinect FPS: 14.5 (RGB-D)
SLAM Update Rate: 1 Hz
GPU Memory: 0.28-2GB depending on model

System Resources

RAM Usage: 4-8GB
GPU Memory: 2-4GB (with all models loaded)
CPU Usage: 30-50% (4 cores)

🛠️ Development

Project Structure

HowYouSeeMe/
├── ros2_ws/                    # ROS2 workspace
│   └── src/
│       ├── cv_pipeline/        # CV Pipeline package
│       │   └── python/         # AI model workers
│       ├── kinect2_ros2_cuda/  # Kinect bridge
│       └── bluelily_bridge/    # IMU integration
├── docs/                       # Documentation
├── BlueLily/                   # IMU firmware
├── scripts/                    # Utility scripts
├── launch_*.sh                 # Launch scripts
├── cv_menu.sh                  # Interactive menu
└── README.md                   # This file

Adding New Models

See ADD_NEW_MODEL_GUIDE.md for instructions on integrating new AI models.

Key Components

cv_model_manager.py: Model loading and management
sam2_server_v2.py: Main CV pipeline server
sam2_worker.py: SAM2 model worker
yolo11_worker.py: YOLO11 model worker
insightface_worker.py: Face recognition and emotion detection
fastsam_worker.py: FastSAM model worker

🎓 Features in Detail

Streaming Support

All models support continuous streaming:

Duration: Set in seconds or -1 for continuous
FPS: Configurable 1-30 FPS
Stop Command: Instant stop without restart
Model Switching: Switch between models during streaming

Visualization

RViz Integration: Real-time visualization
Color-Coded Results: Different colors for different detections
Bounding Boxes: Object and face detection
Segmentation Masks: Transparent overlays
Emotion Colors: Color-coded emotions
Pose Keypoints: Human skeleton visualization

Face Database

Persistent Storage: Face embeddings saved to disk
Multiple Samples: Register multiple images per person
Metadata Tracking: Names, timestamps, encounter counts
Similarity Threshold: Configurable recognition threshold

SLAM Features

3D Mapping: Real-time point cloud generation
Loop Closure: Automatic map correction
Odometry: Visual-inertial odometry
Localization: 6-DOF pose estimation
Map Saving: Persistent map storage

🐛 Troubleshooting

Common Issues

Server not starting?

# Check if processes are running
ps aux | grep sam2_server

# Kill existing processes
./kill_all.sh

# Restart
./launch_kinect_sam2_server.sh

Models not loading?

# Activate conda environment
conda activate howyouseeme

# Reinstall models
./install_sam2.sh
./install_insightface.sh

Kinect not detected?

# Check USB connection
lsusb | grep Xbox

# Restart udev rules
sudo udevadm control --reload-rules
sudo udevadm trigger

CUDA errors?

# Check CUDA installation
nvidia-smi

# Verify CUDA version
nvcc --version

See CV_PIPELINE_TROUBLESHOOTING.md for more solutions.

🔮 Roadmap

✅ Completed (Current Status)

🚧 Short Term (In Progress)

Fix SLAM and Kinect driver - Stability improvements
Test BlueLily integration - IMU fusion validation
IMU fusion with SLAM - Better localization, lower drift
Hand gesture detection - MediaPipe or custom model
MCP Server - Model Context Protocol for LLM integration
Depth + Segmentation fusion - Combine depth with masks
3D world position estimation - Mark YOLO objects on SLAM map
Gaze detection - Eye tracking integration
OCR tool - Text detection and recognition

🎯 Medium Term

World State Summarizer - Unified interface combining all models
Visual Memory System - Remember object locations on SLAM map
Event-based checkpointing - Save frames when humans/objects detected
Async processing - Process past frames in background
Object highlighting - Highlight objects/rooms when discussing
Meta SAM3 - Upgrade to latest segmentation model
Depth Anything - Advanced depth estimation
DINO features - Self-supervised feature extraction

🔮 Long Term Vision

Fix Kinect CUDA bridge - Full GPU acceleration
Extensible model pipeline - Custom sequential model chains
Condition-based pipelines - Dynamic model activation
Gaussian splatting - 3D scene reconstruction
NVBLOX integration - Real-time 3D mapping
LightGlue ONNX - Feature matching
Multi-camera support - Sensor fusion
Web interface - Remote monitoring
Mobile app - Control and visualization

🧠 Intelligent Features

On-demand model loading - Only run required models
Always-on SLAM - Continuous mapping
Selective object detection - Run YOLO when needed
LLM-driven activation - Models triggered by natural language
Spatial memory queries - "Where did I see the apple?"
Object persistence - Track objects across frames
Scene understanding - Semantic room mapping

🤝 Contributing

Contributions are welcome! Please:

Fork the repository
Create a feature branch
Make your changes
Add tests and documentation
Submit a pull request

Development Guidelines

Follow PEP 8 for Python code
Add docstrings to all functions
Update documentation for new features
Test with real Kinect hardware
Ensure ROS2 compatibility

📄 License

MIT License - see LICENSE file for details.

🏗️ System Integration

BlueLily IMU Integration

HowYouSeeMe integrates with BlueLily, a high-performance flight computer and sensing platform:

9-axis IMU (MPU6500) for enhanced SLAM localization
Real-time sensor fusion with Kinect RGB-D data
Reduced drift in SLAM through IMU corrections
ROS2 bridge for seamless data integration

See BlueLily Integration Guide for details.

Architecture Philosophy

On-Demand Processing: Models load only when needed to conserve resources
Always-On SLAM: Continuous mapping for spatial awareness
Selective Detection: YOLO runs based on context and requirements
LLM Integration: Natural language control via MCP server
Visual Memory: Persistent object tracking on SLAM map
Event-Driven: Checkpoint frames when significant events occur

Future Ecosystem

DroidCore: Central robotics platform
Ally: LLM-based cognitive system
Comms: Multi-protocol communication layer
World State API: Unified perception interface

🙏 Acknowledgments

Meta AI - SAM2 model
Ultralytics - YOLO11 and FastSAM
InsightFace - Face recognition models
FER - Emotion detection
RTABMap - SLAM implementation
ROS2 Community - Robotics framework
NVIDIA - CUDA acceleration and NVBLOX

📧 Contact

Email: [email protected]
GitHub: @AryanRai
Issues: GitHub Issues

🌟 Star History

If you find this project useful, please consider giving it a star! ⭐

Built with ❤️ for advanced computer vision and robotics

Last Updated: November 2024

Name		Name	Last commit message	Last commit date
Latest commit History 31 Commits
.vscode		.vscode
data/faces		data/faces
docs		docs
old_kinect_packages		old_kinect_packages
ros2_ws		ros2_ws
.gitattributes		.gitattributes
.gitignore		.gitignore
.gitmodules		.gitmodules
LICENSE		LICENSE
README.md		README.md
WARP.md		WARP.md
capture_and_process_kinect.sh		capture_and_process_kinect.sh
cv_menu.sh		cv_menu.sh
cv_pipeline_menu.sh		cv_pipeline_menu.sh
debug_cv_pipeline.sh		debug_cv_pipeline.sh
full_system_rviz.rviz		full_system_rviz.rviz
install_hsemotion.sh		install_hsemotion.sh
install_insightface.sh		install_insightface.sh
kill_all.sh		kill_all.sh
kill_kinect.sh		kill_kinect.sh
kinect2_rviz_config.rviz		kinect2_rviz_config.rviz
kinect2_slam_rviz.rviz		kinect2_slam_rviz.rviz
launch_full_system_rviz.sh		launch_full_system_rviz.sh
launch_kinect2_ros2_slam_fixed_tf.sh		launch_kinect2_ros2_slam_fixed_tf.sh
launch_kinect2_slam_with_imu.sh		launch_kinect2_slam_with_imu.sh
launch_kinect_sam2_server.sh		launch_kinect_sam2_server.sh
launch_rviz_kinect.sh		launch_rviz_kinect.sh
login_huggingface.py		login_huggingface.py
reset_cv_server.sh		reset_cv_server.sh
sam2_modes_guide.sh		sam2_modes_guide.sh
start_sam2_stream.sh		start_sam2_stream.sh
stop_cv_streaming.sh		stop_cv_streaming.sh
stop_sam2_stream.sh		stop_sam2_stream.sh
test_all_modes.sh		test_all_modes.sh
test_emotion_detection.sh		test_emotion_detection.sh
test_kinect2_ros2.sh		test_kinect2_ros2.sh
test_model_manager.py		test_model_manager.py
test_no_blocking.sh		test_no_blocking.sh
test_streaming_fix.sh		test_streaming_fix.sh

License

AryanRai/HowYouSeeMe

Folders and files

Latest commit

History

Repository files navigation