-
-
Notifications
You must be signed in to change notification settings - Fork 45
Open
Labels
⬆️ high priorityThis issue needs to be addressed like yesterdayThis issue needs to be addressed like yesterday🔴 wontfixThis will not be worked on for nowThis will not be worked on for now✨ enhancementNew feature or request or improvementNew feature or request or improvementmaintainers-onlyOnly a maintainer can work on thisOnly a maintainer can work on this✨jaiIssues, PRs or questions related to the ✨jAI moduleIssues, PRs or questions related to the ✨jAI module
Description
Currently, the dictionary's vector store is updated immediately when words are submitted through the editor, before the corresponding Pull Request is reviewed and merged. This creates a critical synchronization issue:
- Ghost Entries: Words appear in AI search results even if their PRs are rejected or closed
- Data Inconsistency: Vector store contains words that don't exist in the live dictionary
- User Confusion: Users can find words through JAI that lead to 404 pages
Current Flow (Problematic)
- User submits word via editor
- GitHub operations complete (fork, branch, commit, PR creation)
- Vector store updated immediately
⚠️ - PR may be rejected/closed
- Result: Searchable word that doesn't exist
Proposed Solution: Webhook-Based Synchronization
Implement GitHub webhook integration to sync vector store only when PRs are actually merged into the main dictionary.
Implementation Steps
Phase 1: Core Webhook Infrastructure
- Create webhook endpoint at
/api/webhook/github - Configure GitHub webhook in repository settings
- Events:
pull_request(closed) - Payload URL:
https://yourdomain.com/api/webhook/github - Secret: Environment variable for security
- Events:
- Add webhook signature verification for security
- Test webhook reception with dummy PRs
Phase 2: Word Content Extraction
- Implement content extraction logic
- Option A: Parse from PR description (if structured)
- Option B: API call to get merged file content
- Option C: Store in PR metadata/labels during submission
- Parse branch naming convention to identify word actions
- Current pattern:
{action}-{word-slug}(e.g.,add-machine-learning)
- Current pattern:
- Extract word title and content from merged changes
Phase 3: Queue Management System
- Implement sync queue for pending operations
- Store:
{ prBranch, action, wordData, timestamp } - Methods:
queueWordForSync(),processQueuedSync(),removeQueuedSync()
- Store:
- Update word submission API to use queue instead of immediate sync
- Handle PR closure without merge (cleanup queued items)
Phase 4: Vector Store Integration
- Update webhook handler to process merged PRs
- Verify PR is merged (
pull_request.merged === true) - Confirm target repository matches main dictionary
- Extract word data and sync to vector store
- Verify PR is merged (
- Add error handling and retry logic for vector store operations
- Implement logging and monitoring for sync operations
Phase 5: Cleanup and Migration
- Create cleanup utility to remove existing ghost entries
- Compare current dictionary with vector store contents
- Remove orphaned entries
- Add environment controls for enabling/disabling sync
- Update documentation for webhook setup and troubleshooting
Technical Details
Branch Naming Convention
Current pattern: {action}-{word-title-slug}
- New word:
add-machine-learning - Edit word:
edit-artificial-intelligence
Webhook Payload Processing
// Extract from PR event
const { action, pull_request, repository } = payload;
const prBranch = pull_request.head.ref;
const [, wordAction, wordSlug] = prBranch.match(/^(add|edit)-(.+)$/);Queue Storage Options
- Development: In-memory Map (current implementation)
- Production: Redis or database table for persistence
- Backup: File-based queue for reliability
Security Considerations
- Webhook signature verification using GitHub secret
- Rate limiting on webhook endpoint
- Input validation for all extracted data
- Access control for sync queue management
Benefits
- Data Integrity: Vector store only contains approved dictionary entries
- Consistency: Search results always match live dictionary
- Reliability: Automatic sync without manual intervention
- Scalability: Handles high volume of submissions
- Auditability: Clear log of all sync operations
Risks and Mitigation
| Risk | Impact | Mitigation |
|---|---|---|
| Webhook delivery failure | Missed syncs | Implement retry mechanism and monitoring |
| Content extraction failure | Incomplete syncs | Multiple extraction strategies and fallbacks |
| Vector store downtime | Sync failures | Queue persistence and retry logic |
| High PR volume | Performance issues | Batch processing and rate limiting |
Testing Strategy
- Unit Tests: Queue management and content extraction functions
- Integration Tests: Webhook endpoint with mock GitHub payloads
- End-to-End Tests: Full workflow from PR creation to vector store sync
- Load Tests: High volume PR processing simulation
Success Metrics
- Zero ghost entries: Vector store entries match live dictionary 100%
- Sync reliability: >99% successful sync rate for merged PRs
- Performance: Webhook processing <2s per PR
- User experience: No 404s from AI search results
Future Enhancements
- Real-time sync status in editor UI
- Bulk sync utility for historical data migration
- Sync analytics dashboard for monitoring
- Multi-environment support (staging, production)
- Webhook replay mechanism for debugging
Metadata
Metadata
Assignees
Labels
⬆️ high priorityThis issue needs to be addressed like yesterdayThis issue needs to be addressed like yesterday🔴 wontfixThis will not be worked on for nowThis will not be worked on for now✨ enhancementNew feature or request or improvementNew feature or request or improvementmaintainers-onlyOnly a maintainer can work on thisOnly a maintainer can work on this✨jaiIssues, PRs or questions related to the ✨jAI moduleIssues, PRs or questions related to the ✨jAI module