Skip to content

Conversation

@colin-ho
Copy link
Contributor

Changes Made

Related Issues

Checklist

  • Documented in API Docs (if applicable)
  • Documented in User Guide (if applicable)
  • If adding a new documentation page, doc is added to docs/mkdocs.yml navigation
  • Documentation builds and is formatted properly

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Greptile Overview

Summary

Refactored the document embedding benchmark to use Daft's built-in embed_text function instead of a custom UDF class. This simplifies the code by removing ~30 lines of custom embedding logic.

Key changes:

  • Replaced custom Embedder UDF class (30 lines) with embed_text(df["chunk"], provider="sentence_transformers", model=EMBED_MODEL_ID)
  • Removed unused imports: torch
  • Removed constants: EMBEDDING_DIM, NUM_GPU_NODES, EMBEDDING_BATCH_SIZE (now handled internally by embed_text)
  • Added import: from daft.functions.ai import embed_text

The new embed_text API automatically handles:

  • GPU/CPU device selection via SentenceTransformer's automatic device detection
  • Concurrency settings based on available GPUs using get_gpu_udf_options() (detects Ray cluster GPUs)
  • Embedding dimensions via AutoConfig.from_pretrained()
  • Model compilation and inference mode optimizations

This change aligns the benchmark with Daft's recommended patterns for AI operations and makes the code more maintainable.

Confidence Score: 5/5

  • This PR is safe to merge with minimal risk
  • The refactoring replaces custom UDF code with Daft's built-in embed_text API. The new API provides equivalent or better functionality with automatic GPU detection and concurrency management that matches the original hardcoded settings (8 GPU nodes, 1 GPU per worker). The implementation has been verified against test cases and follows project conventions.
  • No files require special attention

Important Files Changed

File Analysis

Filename Score Overview
benchmarking/ai/document_embedding/daft_main.py 5/5 Refactored to use the new embed_text API, replacing custom Embedder UDF class with built-in function

Sequence Diagram

sequenceDiagram
    participant Benchmark as Benchmark Script
    participant Daft as Daft DataFrame
    participant EmbedText as embed_text()
    participant Provider as SentenceTransformersProvider
    participant UDF as UDF System
    participant Model as SentenceTransformer Model

    Benchmark->>Daft: Load & process PDFs
    Daft->>Daft: Extract text, chunk documents
    Benchmark->>EmbedText: embed_text(chunks, provider, model)
    EmbedText->>Provider: get_text_embedder(model)
    Provider->>Provider: get_gpu_udf_options()
    Note over Provider: Detects Ray cluster GPUs<br/>Sets concurrency automatically
    Provider->>EmbedText: Return TextEmbedderDescriptor
    EmbedText->>UDF: Create UDF with GPU settings
    UDF->>Model: Initialize SentenceTransformer
    Note over Model: Auto device selection<br/>(CUDA/MPS/CPU)
    UDF->>Model: encode(text_batches)
    Model->>UDF: Return embeddings
    UDF->>Daft: Add embedding column
    Daft->>Benchmark: Write to Parquet
Loading

1 file reviewed, no comments

Edit Code Review Agent Settings | Greptile

@codecov
Copy link

codecov bot commented Oct 10, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 74.70%. Comparing base (eb0058b) to head (130dee2).

Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #5365      +/-   ##
==========================================
- Coverage   75.27%   74.70%   -0.57%     
==========================================
  Files         988      988              
  Lines      124736   125664     +928     
==========================================
- Hits        93889    93879      -10     
- Misses      30847    31785     +938     

see 21 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants