-
Notifications
You must be signed in to change notification settings - Fork 3
GML-2011 Local temp files #21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
PR Reviewer Guide 🔍Here are some key observations to aid the review process:
|
|
@prinskumar-tigergraph please rebase to latest main. |
2888900 to
3bbe8bc
Compare
common/utils/text_extractors.py
Outdated
| for idx, doc_data in enumerate(doc_entries): | ||
| # Use file_counter for unique naming across all files | ||
| counter_val = next(file_counter) if file_counter else idx | ||
| doc_filename = f"doc_{counter_val}_{doc_data.get('doc_id', 'unknown')}.json" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given we're already write text/image to separate json files, we can consider using the old jsonl format to include everything in a single jsonl file, which can be loaded directly with conn.runLoadingJobWithFile() without needing extra processing. @prinskumar-tigergraph
…df4llm integration
aa1ce34 to
2d54b02
Compare
|
closing this PR, created the separate branch |
User description
Added Local temp folder to support processing files so that it wont run out of memory
Ui changes accordingly
PR Type
Enhancement, Documentation, Bug fix
Description
Temp storage for server ingestion review
UI to review/delete, optional direct ingest
PDF extraction via pymupdf4llm with captions
Add S3/Bedrock ingest config and docs
Diagram Walkthrough
File Walkthrough
6 files
Switch PDF extraction to pymupdf4llm with temp foldersSave server-processed docs to temp JSON and ingestAdd endpoints to list and delete ingestion temp filesSimplify image description API; remove legacy saverNew MarkdownProcessor for image refs and captionsUI review workflow, direct ingest toggle, S3 Bedrock2 files
Add AGPL-3.0 license for pymupdf4llm dependencyUpdate setup, provider configs, and prompt paths1 files
Add pymupdf4llm, remove direct PyMuPDF pin