Vector Store Synchronization on New Word Addition

Currently, the dictionary's vector store is updated immediately when words are submitted through the editor, before the corresponding Pull Request is reviewed and merged. This creates a critical synchronization issue:

- **Ghost Entries**: Words appear in AI search results even if their PRs are rejected or closed
- **Data Inconsistency**: Vector store contains words that don't exist in the live dictionary
- **User Confusion**: Users can find words through JAI that lead to 404 pages

## Current Flow (Problematic)
1. User submits word via editor
2. GitHub operations complete (fork, branch, commit, PR creation)
3. **Vector store updated immediately** ⚠️
4. PR may be rejected/closed
5. Result: Searchable word that doesn't exist

## Proposed Solution: Webhook-Based Synchronization

Implement GitHub webhook integration to sync vector store only when PRs are actually merged into the main dictionary.

### Implementation Steps

#### Phase 1: Core Webhook Infrastructure
- [ ] **Create webhook endpoint** at `/api/webhook/github`
- [ ] **Configure GitHub webhook** in repository settings
  - Events: `pull_request` (closed)
  - Payload URL: `https://yourdomain.com/api/webhook/github`
  - Secret: Environment variable for security
- [ ] **Add webhook signature verification** for security
- [ ] **Test webhook reception** with dummy PRs

#### Phase 2: Word Content Extraction
- [ ] **Implement content extraction logic**
  - Option A: Parse from PR description (if structured)
  - Option B: API call to get merged file content
  - Option C: Store in PR metadata/labels during submission
- [ ] **Parse branch naming convention** to identify word actions
  - Current pattern: `{action}-{word-slug}` (e.g., `add-machine-learning`)
- [ ] **Extract word title and content** from merged changes

#### Phase 3: Queue Management System
- [ ] **Implement sync queue** for pending operations
  - Store: `{ prBranch, action, wordData, timestamp }`
  - Methods: `queueWordForSync()`, `processQueuedSync()`, `removeQueuedSync()`
- [ ] **Update word submission API** to use queue instead of immediate sync
- [ ] **Handle PR closure without merge** (cleanup queued items)

#### Phase 4: Vector Store Integration
- [ ] **Update webhook handler** to process merged PRs
  - Verify PR is merged (`pull_request.merged === true`)
  - Confirm target repository matches main dictionary
  - Extract word data and sync to vector store
- [ ] **Add error handling and retry logic** for vector store operations
- [ ] **Implement logging and monitoring** for sync operations

#### Phase 5: Cleanup and Migration
- [ ] **Create cleanup utility** to remove existing ghost entries
  - Compare current dictionary with vector store contents
  - Remove orphaned entries
- [ ] **Add environment controls** for enabling/disabling sync
- [ ] **Update documentation** for webhook setup and troubleshooting

### Technical Details

#### Branch Naming Convention
Current pattern: `{action}-{word-title-slug}`
- New word: `add-machine-learning`
- Edit word: `edit-artificial-intelligence`

#### Webhook Payload Processing
```javascript
// Extract from PR event
const { action, pull_request, repository } = payload;
const prBranch = pull_request.head.ref;
const [, wordAction, wordSlug] = prBranch.match(/^(add|edit)-(.+)$/);
```

#### Queue Storage Options
- **Development**: In-memory Map (current implementation)
- **Production**: Redis or database table for persistence
- **Backup**: File-based queue for reliability

#### Security Considerations
- [ ] **Webhook signature verification** using GitHub secret
- [ ] **Rate limiting** on webhook endpoint
- [ ] **Input validation** for all extracted data
- [ ] **Access control** for sync queue management

### Benefits

1. **Data Integrity**: Vector store only contains approved dictionary entries
2. **Consistency**: Search results always match live dictionary
3. **Reliability**: Automatic sync without manual intervention
4. **Scalability**: Handles high volume of submissions
5. **Auditability**: Clear log of all sync operations

### Risks and Mitigation

| Risk | Impact | Mitigation |
|------|--------|------------|
| Webhook delivery failure | Missed syncs | Implement retry mechanism and monitoring |
| Content extraction failure | Incomplete syncs | Multiple extraction strategies and fallbacks |
| Vector store downtime | Sync failures | Queue persistence and retry logic |
| High PR volume | Performance issues | Batch processing and rate limiting |

### Testing Strategy

1. **Unit Tests**: Queue management and content extraction functions
2. **Integration Tests**: Webhook endpoint with mock GitHub payloads
3. **End-to-End Tests**: Full workflow from PR creation to vector store sync
4. **Load Tests**: High volume PR processing simulation

### Success Metrics

- **Zero ghost entries**: Vector store entries match live dictionary 100%
- **Sync reliability**: >99% successful sync rate for merged PRs
- **Performance**: Webhook processing <2s per PR
- **User experience**: No 404s from AI search results

### Future Enhancements

- [ ] **Real-time sync status** in editor UI
- [ ] **Bulk sync utility** for historical data migration
- [ ] **Sync analytics dashboard** for monitoring
- [ ] **Multi-environment support** (staging, production)
- [ ] **Webhook replay mechanism** for debugging

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Vector Store Synchronization on New Word Addition #196

Current Flow (Problematic)

Proposed Solution: Webhook-Based Synchronization

Implementation Steps

Phase 1: Core Webhook Infrastructure

Phase 2: Word Content Extraction

Phase 3: Queue Management System

Phase 4: Vector Store Integration

Phase 5: Cleanup and Migration

Technical Details

Branch Naming Convention

Webhook Payload Processing

Queue Storage Options

Security Considerations

Benefits

Risks and Mitigation

Testing Strategy

Success Metrics

Future Enhancements

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Risk	Impact	Mitigation
Webhook delivery failure	Missed syncs	Implement retry mechanism and monitoring
Content extraction failure	Incomplete syncs	Multiple extraction strategies and fallbacks
Vector store downtime	Sync failures	Queue persistence and retry logic
High PR volume	Performance issues	Batch processing and rate limiting

Uh oh!

Vector Store Synchronization on New Word Addition #196

Description

Current Flow (Problematic)

Proposed Solution: Webhook-Based Synchronization

Implementation Steps

Phase 1: Core Webhook Infrastructure

Phase 2: Word Content Extraction

Phase 3: Queue Management System

Phase 4: Vector Store Integration

Phase 5: Cleanup and Migration

Technical Details

Branch Naming Convention

Webhook Payload Processing

Queue Storage Options

Security Considerations

Benefits

Risks and Mitigation

Testing Strategy

Success Metrics

Future Enhancements

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions