|
| 1 | +# AGENTS Instructions |
| 2 | + |
| 3 | +This repository contains Python bindings for Rust's DataFusion. |
| 4 | + |
| 5 | +## Development workflow |
| 6 | +- Ensure git submodules are initialized: `git submodule update --init`. |
| 7 | +- Build the Rust extension before running tests: |
| 8 | + - `uv run --no-project maturin develop --uv` |
| 9 | +- Run tests with pytest: |
| 10 | + - `uv --no-project pytest .` |
| 11 | + |
| 12 | +## Linting and formatting |
| 13 | +- Use pre-commit for linting/formatting. |
| 14 | +- Run hooks for changed files before committing: |
| 15 | + - `pre-commit run --files <files>` |
| 16 | + - or `pre-commit run --all-files` |
| 17 | +- Hooks enforce: |
| 18 | + - Python linting/formatting via Ruff |
| 19 | + - Rust formatting via `cargo fmt` |
| 20 | + - Rust linting via `cargo clippy` |
| 21 | +- Ruff rules that frequently fail in this repo: |
| 22 | + - **Import sorting (`I001`)**: Keep import blocks sorted/grouped. Running `ruff check --select I --fix <files>` will repair order. |
| 23 | + - **Type-checking guards (`TCH001`)**: Place imports that are only needed for typing (e.g., `AggregateUDF`, `ScalarUDF`, `TableFunction`, `WindowUDF`, `NullTreatment`, `DataFrame`) inside a `if TYPE_CHECKING:` block. |
| 24 | + - **Docstring spacing (`D202`, `D205`)**: The summary line must be separated from the body with exactly one blank line, and there must be no blank line immediately after the closing triple quotes. |
| 25 | + - **Ternary suggestions (`SIM108`)**: Prefer single-line ternary expressions when Ruff requests them over multi-line `if`/`else` assignments. |
| 26 | + |
| 27 | +## Notes |
| 28 | +- The repository mixes Python and Rust; ensure changes build for both languages. |
| 29 | +- If adding new dependencies, update `pyproject.toml` and run `uv sync --dev --no-install-package datafusion`. |
| 30 | + |
| 31 | +## Rust insights |
| 32 | + |
| 33 | +Below are a set of concise mental-model shifts and expert insights about Rust that are helpful when developing and reviewing the Rust parts of this repository. They emphasize how to think in terms of compile-time guarantees, capabilities, and algebraic composition rather than just language ergonomics. |
| 34 | + |
| 35 | +1. Ownership → Compile-Time Resource Graph |
| 36 | + |
| 37 | +> Stop seeing ownership as “who frees memory.” |
| 38 | +> See it as a **compile-time dataflow graph of resource control**. |
| 39 | +
|
| 40 | +Every `let`, `move`, or `borrow` defines an edge in a graph the compiler statically verifies — ensuring linear usage of scarce resources (files, sockets, locks) **without a runtime GC**. Once you see lifetimes as edges, not annotations, you’re designing **proofs of safety**, not code that merely compiles. |
| 41 | + |
| 42 | +2. Borrowing → Capability Leasing |
| 43 | + |
| 44 | +> Stop thinking of borrowing as “taking a reference.” |
| 45 | +> It’s **temporary permission to mutate or observe**, granted by the compiler’s capability system. |
| 46 | +
|
| 47 | +`&mut` isn’t a pointer — it’s a **lease with exclusive rights**, enforced at compile time. Expert code treats borrows as contracts: |
| 48 | + |
| 49 | +* If you can shorten them, you increase parallelism. |
| 50 | +* If you lengthen them, you increase safety scope. |
| 51 | + |
| 52 | +3. Traits → Behavioral Algebra |
| 53 | + |
| 54 | +> Stop viewing traits as “interfaces.” |
| 55 | +> They’re **algebraic building blocks** that define composable laws of behavior. |
| 56 | +
|
| 57 | +A `Trait` isn't just a promise of methods; it’s a **contract that can be combined, derived, or blanket-implemented**. Once you realize traits form a behavioral lattice, you stop subclassing and start composing — expressing polymorphism as **capabilities, not hierarchies**. |
| 58 | + |
| 59 | +4. `Result` → Explicit Control Flow as Data |
| 60 | + |
| 61 | +> Stop using `Result` as an error type. |
| 62 | +> It’s **control flow reified as data**. |
| 63 | +
|
| 64 | +The `?` operator turns sequential logic into a **monadic pipeline** — your `Result` chain isn’t linear code; it’s a dependency graph of partial successes. Experts design their APIs so every recoverable branch is an **encoded decision**, not a runtime exception. |
| 65 | + |
| 66 | +5. Lifetimes → Static Borrow Slices |
| 67 | + |
| 68 | +> Stop fearing lifetimes as compiler noise. |
| 69 | +> They’re **proofs of local consistency** — mini type-level theorems. |
| 70 | +
|
| 71 | +Each `'a` parameter expresses that two pieces of data **coexist safely** within a bounded region of time. Experts deliberately model relationships through lifetime parameters to **eliminate entire classes of runtime checks**. |
| 72 | + |
| 73 | +6. Pattern Matching → Declarative Exhaustiveness |
| 74 | + |
| 75 | +> Stop thinking of `match` as a fancy switch. |
| 76 | +> It’s a **total function over variants**, verified at compile time. |
| 77 | +
|
| 78 | +Once you realize `match` isn’t branching but **structural enumeration**, you start writing exhaustive domain models where every possible state is named, and every transition is **type-checked**. |
| 79 | + |
| 80 | +Stop seeing `Option` as “value or no value.” Instead, see it as a lazy computation pipeline that only executes when meaningful. These combinators turn error-handling into data flow: once you think of absence as a first-class transformation, you can write algorithms that never mention control flow explicitly—and yet, they’re 100% safe and analyzable by the compiler. |
| 81 | + |
| 82 | + |
| 83 | +## Refactoring opportunities |
| 84 | + - Avoid using private or low-level APIs when a stable, public helper exists. For example, |
| 85 | + automated refactors should spot and replace uses: |
| 86 | + |
| 87 | + ```python |
| 88 | + # Before (uses private/low-level API) |
| 89 | + # PyArrow example |
| 90 | + reader = pa.RecordBatchReader._import_from_c_capsule( |
| 91 | + df.__arrow_c_stream__() |
| 92 | + ) |
| 93 | + |
| 94 | + # After (use public API) |
| 95 | + reader = pa.RecordBatchReader.from_stream(df) |
| 96 | + ``` |
| 97 | + |
| 98 | + Look for call chains that invoke `_import_from_c_capsule` with `__arrow_c_stream__()` |
| 99 | + and prefer `from_stream(df)` instead. This improves readability and avoids |
| 100 | + relying on private PyArrow internals that may change. |
| 101 | + |
| 102 | +## Helper Functions |
| 103 | + |
| 104 | +## Commenting guidance |
| 105 | + |
| 106 | +Use comments intentionally. Prefer three kinds of comments depending on purpose: |
| 107 | + |
| 108 | +- Implementation Comments |
| 109 | + - Explains non-obvious choices and tricky implementations |
| 110 | + - Serves as breadcrumbs for future developers |
| 111 | + |
| 112 | +- Documentation Comments |
| 113 | + - Describes functions, classes, and modules |
| 114 | + - Acts as public interface documentation |
| 115 | + |
| 116 | +- Contextual Comments |
| 117 | + - Documents assumptions, preconditions, and non-obvious requirements |
| 118 | + |
| 119 | +Keep comments concise and up-to-date; prefer clear code over comments when |
| 120 | +possible, and move long-form design notes into the repository docs or an |
| 121 | +appropriate design file. |
| 122 | + |
| 123 | +- `python/datafusion/io.py` offers global context readers: |
| 124 | + - `read_parquet` |
| 125 | + - `read_json` |
| 126 | + - `read_csv` |
| 127 | + - `read_avro` |
| 128 | +- `python/datafusion/user_defined.py` exports convenience creators for user-defined functions: |
| 129 | + - `udf` (scalar) |
| 130 | + - `udaf` (aggregate) |
| 131 | + - `udwf` (window) |
| 132 | + - `udtf` (table) |
| 133 | +- `python/datafusion/col.py` exposes the `Col` helper with `col` and `column` instances for building column expressions using attribute access. |
| 134 | +- `python/datafusion/catalog.py` provides Python-based catalog and schema providers. |
| 135 | +- `python/datafusion/object_store.py` exposes object store connectors: `AmazonS3`, `GoogleCloud`, `MicrosoftAzure`, `LocalFileSystem`, and `Http`. |
| 136 | +- `python/datafusion/unparser.py` converts logical plans back to SQL via the `Dialect` and `Unparser` classes. |
| 137 | +- `python/datafusion/dataframe_formatter.py` offers configurable HTML and string formatting for DataFrames (replaces the deprecated `html_formatter.py`). |
| 138 | +- `python/tests/generic.py` includes utilities for test data generation: |
| 139 | + - `data` |
| 140 | + - `data_with_nans` |
| 141 | + - `data_datetime` |
| 142 | + - `data_date32` |
| 143 | + - `data_timedelta` |
| 144 | + - `data_binary_other` |
| 145 | + - `write_parquet` |
| 146 | +- `python/tests/conftest.py` defines reusable pytest fixtures: |
| 147 | + - `ctx` creates a `SessionContext`. |
| 148 | + - `database` registers a sample CSV dataset. |
| 149 | +- `src/dataframe.rs` provides the `collect_record_batches_to_display` helper to fetch the first non-empty record batch and flag if more are available. |
0 commit comments