refactor: Use embed text in document benchmark #5365

colin-ho · 2025-10-10T17:36:34Z

Changes Made

Related Issues

Checklist

Documented in API Docs (if applicable)
Documented in User Guide (if applicable)
If adding a new documentation page, doc is added to docs/mkdocs.yml navigation
Documentation builds and is formatted properly

greptile-apps

Greptile Overview

Summary

Refactored the document embedding benchmark to use Daft's built-in embed_text function instead of a custom UDF class. This simplifies the code by removing ~30 lines of custom embedding logic.

Key changes:

Replaced custom Embedder UDF class (30 lines) with embed_text(df["chunk"], provider="sentence_transformers", model=EMBED_MODEL_ID)
Removed unused imports: torch
Removed constants: EMBEDDING_DIM, NUM_GPU_NODES, EMBEDDING_BATCH_SIZE (now handled internally by embed_text)
Added import: from daft.functions.ai import embed_text

The new embed_text API automatically handles:

GPU/CPU device selection via SentenceTransformer's automatic device detection
Concurrency settings based on available GPUs using get_gpu_udf_options() (detects Ray cluster GPUs)
Embedding dimensions via AutoConfig.from_pretrained()
Model compilation and inference mode optimizations

This change aligns the benchmark with Daft's recommended patterns for AI operations and makes the code more maintainable.

Confidence Score: 5/5

This PR is safe to merge with minimal risk
The refactoring replaces custom UDF code with Daft's built-in embed_text API. The new API provides equivalent or better functionality with automatic GPU detection and concurrency management that matches the original hardcoded settings (8 GPU nodes, 1 GPU per worker). The implementation has been verified against test cases and follows project conventions.
No files require special attention

Important Files Changed

File Analysis

Filename	Score	Overview
benchmarking/ai/document_embedding/daft_main.py	5/5	Refactored to use the new `embed_text` API, replacing custom Embedder UDF class with built-in function

Sequence Diagram

sequenceDiagram
    participant Benchmark as Benchmark Script
    participant Daft as Daft DataFrame
    participant EmbedText as embed_text()
    participant Provider as SentenceTransformersProvider
    participant UDF as UDF System
    participant Model as SentenceTransformer Model

    Benchmark->>Daft: Load & process PDFs
    Daft->>Daft: Extract text, chunk documents
    Benchmark->>EmbedText: embed_text(chunks, provider, model)
    EmbedText->>Provider: get_text_embedder(model)
    Provider->>Provider: get_gpu_udf_options()
    Note over Provider: Detects Ray cluster GPUs<br/>Sets concurrency automatically
    Provider->>EmbedText: Return TextEmbedderDescriptor
    EmbedText->>UDF: Create UDF with GPU settings
    UDF->>Model: Initialize SentenceTransformer
    Note over Model: Auto device selection<br/>(CUDA/MPS/CPU)
    UDF->>Model: encode(text_batches)
    Model->>UDF: Return embeddings
    UDF->>Daft: Add embedding column
    Daft->>Benchmark: Write to Parquet

_{1 file reviewed, no comments}

_{Edit Code Review Agent Settings | Greptile}

codecov · 2025-10-10T18:22:30Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 74.70%. Comparing base (eb0058b) to head (130dee2).

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #5365      +/-   ##
==========================================
- Coverage   75.27%   74.70%   -0.57%     
==========================================
  Files         988      988              
  Lines      124736   125664     +928     
==========================================
- Hits        93889    93879      -10     
- Misses      30847    31785     +938

see 21 files with indirect coverage changes

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

use embed text

f389ed4

github-actions bot added the refactor label Oct 10, 2025

greptile-apps bot reviewed Oct 10, 2025

View reviewed changes

Merge branch main into colin/use-embed-text-in-document-benchmark

fa08393

colin-ho added 3 commits October 10, 2025 16:02

Merge branch 'main' into colin/use-embed-text-in-document-benchmark

130dee2

use daft func

fccd0ae

fix

6b863f0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

refactor: Use embed text in document benchmark #5365

refactor: Use embed text in document benchmark #5365

Uh oh!

colin-ho commented Oct 10, 2025

Uh oh!

greptile-apps bot left a comment

Uh oh!

codecov bot commented Oct 10, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

refactor: Use embed text in document benchmark #5365

Are you sure you want to change the base?

refactor: Use embed text in document benchmark #5365

Uh oh!

Conversation

colin-ho commented Oct 10, 2025

Changes Made

Related Issues

Checklist

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Greptile Overview

Summary

Confidence Score: 5/5

Important Files Changed

Sequence Diagram

Uh oh!

codecov bot commented Oct 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

codecov bot commented Oct 10, 2025 •

edited

Loading