Skip to content

Vector Store Synchronization on New Word Addition #196

@babblebey

Description

@babblebey

Currently, the dictionary's vector store is updated immediately when words are submitted through the editor, before the corresponding Pull Request is reviewed and merged. This creates a critical synchronization issue:

  • Ghost Entries: Words appear in AI search results even if their PRs are rejected or closed
  • Data Inconsistency: Vector store contains words that don't exist in the live dictionary
  • User Confusion: Users can find words through JAI that lead to 404 pages

Current Flow (Problematic)

  1. User submits word via editor
  2. GitHub operations complete (fork, branch, commit, PR creation)
  3. Vector store updated immediately ⚠️
  4. PR may be rejected/closed
  5. Result: Searchable word that doesn't exist

Proposed Solution: Webhook-Based Synchronization

Implement GitHub webhook integration to sync vector store only when PRs are actually merged into the main dictionary.

Implementation Steps

Phase 1: Core Webhook Infrastructure

  • Create webhook endpoint at /api/webhook/github
  • Configure GitHub webhook in repository settings
    • Events: pull_request (closed)
    • Payload URL: https://yourdomain.com/api/webhook/github
    • Secret: Environment variable for security
  • Add webhook signature verification for security
  • Test webhook reception with dummy PRs

Phase 2: Word Content Extraction

  • Implement content extraction logic
    • Option A: Parse from PR description (if structured)
    • Option B: API call to get merged file content
    • Option C: Store in PR metadata/labels during submission
  • Parse branch naming convention to identify word actions
    • Current pattern: {action}-{word-slug} (e.g., add-machine-learning)
  • Extract word title and content from merged changes

Phase 3: Queue Management System

  • Implement sync queue for pending operations
    • Store: { prBranch, action, wordData, timestamp }
    • Methods: queueWordForSync(), processQueuedSync(), removeQueuedSync()
  • Update word submission API to use queue instead of immediate sync
  • Handle PR closure without merge (cleanup queued items)

Phase 4: Vector Store Integration

  • Update webhook handler to process merged PRs
    • Verify PR is merged (pull_request.merged === true)
    • Confirm target repository matches main dictionary
    • Extract word data and sync to vector store
  • Add error handling and retry logic for vector store operations
  • Implement logging and monitoring for sync operations

Phase 5: Cleanup and Migration

  • Create cleanup utility to remove existing ghost entries
    • Compare current dictionary with vector store contents
    • Remove orphaned entries
  • Add environment controls for enabling/disabling sync
  • Update documentation for webhook setup and troubleshooting

Technical Details

Branch Naming Convention

Current pattern: {action}-{word-title-slug}

  • New word: add-machine-learning
  • Edit word: edit-artificial-intelligence

Webhook Payload Processing

// Extract from PR event
const { action, pull_request, repository } = payload;
const prBranch = pull_request.head.ref;
const [, wordAction, wordSlug] = prBranch.match(/^(add|edit)-(.+)$/);

Queue Storage Options

  • Development: In-memory Map (current implementation)
  • Production: Redis or database table for persistence
  • Backup: File-based queue for reliability

Security Considerations

  • Webhook signature verification using GitHub secret
  • Rate limiting on webhook endpoint
  • Input validation for all extracted data
  • Access control for sync queue management

Benefits

  1. Data Integrity: Vector store only contains approved dictionary entries
  2. Consistency: Search results always match live dictionary
  3. Reliability: Automatic sync without manual intervention
  4. Scalability: Handles high volume of submissions
  5. Auditability: Clear log of all sync operations

Risks and Mitigation

Risk Impact Mitigation
Webhook delivery failure Missed syncs Implement retry mechanism and monitoring
Content extraction failure Incomplete syncs Multiple extraction strategies and fallbacks
Vector store downtime Sync failures Queue persistence and retry logic
High PR volume Performance issues Batch processing and rate limiting

Testing Strategy

  1. Unit Tests: Queue management and content extraction functions
  2. Integration Tests: Webhook endpoint with mock GitHub payloads
  3. End-to-End Tests: Full workflow from PR creation to vector store sync
  4. Load Tests: High volume PR processing simulation

Success Metrics

  • Zero ghost entries: Vector store entries match live dictionary 100%
  • Sync reliability: >99% successful sync rate for merged PRs
  • Performance: Webhook processing <2s per PR
  • User experience: No 404s from AI search results

Future Enhancements

  • Real-time sync status in editor UI
  • Bulk sync utility for historical data migration
  • Sync analytics dashboard for monitoring
  • Multi-environment support (staging, production)
  • Webhook replay mechanism for debugging

Metadata

Metadata

Assignees

No one assigned

    Labels

    ⬆️ high priorityThis issue needs to be addressed like yesterday🔴 wontfixThis will not be worked on for now✨ enhancementNew feature or request or improvementmaintainers-onlyOnly a maintainer can work on this✨jaiIssues, PRs or questions related to the ✨jAI module

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions