-
Notifications
You must be signed in to change notification settings - Fork 326
refactor: Use embed text in document benchmark #5365
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Greptile Overview
Summary
Refactored the document embedding benchmark to use Daft's built-in embed_text function instead of a custom UDF class. This simplifies the code by removing ~30 lines of custom embedding logic.
Key changes:
- Replaced custom
EmbedderUDF class (30 lines) withembed_text(df["chunk"], provider="sentence_transformers", model=EMBED_MODEL_ID) - Removed unused imports:
torch - Removed constants:
EMBEDDING_DIM,NUM_GPU_NODES,EMBEDDING_BATCH_SIZE(now handled internally byembed_text) - Added import:
from daft.functions.ai import embed_text
The new embed_text API automatically handles:
- GPU/CPU device selection via
SentenceTransformer's automatic device detection - Concurrency settings based on available GPUs using
get_gpu_udf_options()(detects Ray cluster GPUs) - Embedding dimensions via
AutoConfig.from_pretrained() - Model compilation and inference mode optimizations
This change aligns the benchmark with Daft's recommended patterns for AI operations and makes the code more maintainable.
Confidence Score: 5/5
- This PR is safe to merge with minimal risk
- The refactoring replaces custom UDF code with Daft's built-in
embed_textAPI. The new API provides equivalent or better functionality with automatic GPU detection and concurrency management that matches the original hardcoded settings (8 GPU nodes, 1 GPU per worker). The implementation has been verified against test cases and follows project conventions. - No files require special attention
Important Files Changed
File Analysis
| Filename | Score | Overview |
|---|---|---|
| benchmarking/ai/document_embedding/daft_main.py | 5/5 | Refactored to use the new embed_text API, replacing custom Embedder UDF class with built-in function |
Sequence Diagram
sequenceDiagram
participant Benchmark as Benchmark Script
participant Daft as Daft DataFrame
participant EmbedText as embed_text()
participant Provider as SentenceTransformersProvider
participant UDF as UDF System
participant Model as SentenceTransformer Model
Benchmark->>Daft: Load & process PDFs
Daft->>Daft: Extract text, chunk documents
Benchmark->>EmbedText: embed_text(chunks, provider, model)
EmbedText->>Provider: get_text_embedder(model)
Provider->>Provider: get_gpu_udf_options()
Note over Provider: Detects Ray cluster GPUs<br/>Sets concurrency automatically
Provider->>EmbedText: Return TextEmbedderDescriptor
EmbedText->>UDF: Create UDF with GPU settings
UDF->>Model: Initialize SentenceTransformer
Note over Model: Auto device selection<br/>(CUDA/MPS/CPU)
UDF->>Model: encode(text_batches)
Model->>UDF: Return embeddings
UDF->>Daft: Add embedding column
Daft->>Benchmark: Write to Parquet
1 file reviewed, no comments
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #5365 +/- ##
==========================================
- Coverage 75.27% 74.70% -0.57%
==========================================
Files 988 988
Lines 124736 125664 +928
==========================================
- Hits 93889 93879 -10
- Misses 30847 31785 +938 🚀 New features to boost your workflow:
|
Changes Made
Related Issues
Checklist
docs/mkdocs.ymlnavigation