Skip to content

Commit 432bb08

Browse files
committed
UNPICK START
1 parent 908d6c8 commit 432bb08

File tree

1 file changed

+149
-0
lines changed

1 file changed

+149
-0
lines changed

AGENTS.md

Lines changed: 149 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,149 @@
1+
# AGENTS Instructions
2+
3+
This repository contains Python bindings for Rust's DataFusion.
4+
5+
## Development workflow
6+
- Ensure git submodules are initialized: `git submodule update --init`.
7+
- Build the Rust extension before running tests:
8+
- `uv run --no-project maturin develop --uv`
9+
- Run tests with pytest:
10+
- `uv --no-project pytest .`
11+
12+
## Linting and formatting
13+
- Use pre-commit for linting/formatting.
14+
- Run hooks for changed files before committing:
15+
- `pre-commit run --files <files>`
16+
- or `pre-commit run --all-files`
17+
- Hooks enforce:
18+
- Python linting/formatting via Ruff
19+
- Rust formatting via `cargo fmt`
20+
- Rust linting via `cargo clippy`
21+
- Ruff rules that frequently fail in this repo:
22+
- **Import sorting (`I001`)**: Keep import blocks sorted/grouped. Running `ruff check --select I --fix <files>` will repair order.
23+
- **Type-checking guards (`TCH001`)**: Place imports that are only needed for typing (e.g., `AggregateUDF`, `ScalarUDF`, `TableFunction`, `WindowUDF`, `NullTreatment`, `DataFrame`) inside a `if TYPE_CHECKING:` block.
24+
- **Docstring spacing (`D202`, `D205`)**: The summary line must be separated from the body with exactly one blank line, and there must be no blank line immediately after the closing triple quotes.
25+
- **Ternary suggestions (`SIM108`)**: Prefer single-line ternary expressions when Ruff requests them over multi-line `if`/`else` assignments.
26+
27+
## Notes
28+
- The repository mixes Python and Rust; ensure changes build for both languages.
29+
- If adding new dependencies, update `pyproject.toml` and run `uv sync --dev --no-install-package datafusion`.
30+
31+
## Rust insights
32+
33+
Below are a set of concise mental-model shifts and expert insights about Rust that are helpful when developing and reviewing the Rust parts of this repository. They emphasize how to think in terms of compile-time guarantees, capabilities, and algebraic composition rather than just language ergonomics.
34+
35+
1. Ownership → Compile-Time Resource Graph
36+
37+
> Stop seeing ownership as “who frees memory.”
38+
> See it as a **compile-time dataflow graph of resource control**.
39+
40+
Every `let`, `move`, or `borrow` defines an edge in a graph the compiler statically verifies — ensuring linear usage of scarce resources (files, sockets, locks) **without a runtime GC**. Once you see lifetimes as edges, not annotations, you’re designing **proofs of safety**, not code that merely compiles.
41+
42+
2. Borrowing → Capability Leasing
43+
44+
> Stop thinking of borrowing as “taking a reference.”
45+
> It’s **temporary permission to mutate or observe**, granted by the compiler’s capability system.
46+
47+
`&mut` isn’t a pointer — it’s a **lease with exclusive rights**, enforced at compile time. Expert code treats borrows as contracts:
48+
49+
* If you can shorten them, you increase parallelism.
50+
* If you lengthen them, you increase safety scope.
51+
52+
3. Traits → Behavioral Algebra
53+
54+
> Stop viewing traits as “interfaces.”
55+
> They’re **algebraic building blocks** that define composable laws of behavior.
56+
57+
A `Trait` isn't just a promise of methods; it’s a **contract that can be combined, derived, or blanket-implemented**. Once you realize traits form a behavioral lattice, you stop subclassing and start composing — expressing polymorphism as **capabilities, not hierarchies**.
58+
59+
4. `Result` → Explicit Control Flow as Data
60+
61+
> Stop using `Result` as an error type.
62+
> It’s **control flow reified as data**.
63+
64+
The `?` operator turns sequential logic into a **monadic pipeline** — your `Result` chain isn’t linear code; it’s a dependency graph of partial successes. Experts design their APIs so every recoverable branch is an **encoded decision**, not a runtime exception.
65+
66+
5. Lifetimes → Static Borrow Slices
67+
68+
> Stop fearing lifetimes as compiler noise.
69+
> They’re **proofs of local consistency** — mini type-level theorems.
70+
71+
Each `'a` parameter expresses that two pieces of data **coexist safely** within a bounded region of time. Experts deliberately model relationships through lifetime parameters to **eliminate entire classes of runtime checks**.
72+
73+
6. Pattern Matching → Declarative Exhaustiveness
74+
75+
> Stop thinking of `match` as a fancy switch.
76+
> It’s a **total function over variants**, verified at compile time.
77+
78+
Once you realize `match` isn’t branching but **structural enumeration**, you start writing exhaustive domain models where every possible state is named, and every transition is **type-checked**.
79+
80+
Stop seeing `Option` as “value or no value.” Instead, see it as a lazy computation pipeline that only executes when meaningful. These combinators turn error-handling into data flow: once you think of absence as a first-class transformation, you can write algorithms that never mention control flow explicitly—and yet, they’re 100% safe and analyzable by the compiler.
81+
82+
83+
## Refactoring opportunities
84+
- Avoid using private or low-level APIs when a stable, public helper exists. For example,
85+
automated refactors should spot and replace uses:
86+
87+
```python
88+
# Before (uses private/low-level API)
89+
# PyArrow example
90+
reader = pa.RecordBatchReader._import_from_c_capsule(
91+
df.__arrow_c_stream__()
92+
)
93+
94+
# After (use public API)
95+
reader = pa.RecordBatchReader.from_stream(df)
96+
```
97+
98+
Look for call chains that invoke `_import_from_c_capsule` with `__arrow_c_stream__()`
99+
and prefer `from_stream(df)` instead. This improves readability and avoids
100+
relying on private PyArrow internals that may change.
101+
102+
## Helper Functions
103+
104+
## Commenting guidance
105+
106+
Use comments intentionally. Prefer three kinds of comments depending on purpose:
107+
108+
- Implementation Comments
109+
- Explains non-obvious choices and tricky implementations
110+
- Serves as breadcrumbs for future developers
111+
112+
- Documentation Comments
113+
- Describes functions, classes, and modules
114+
- Acts as public interface documentation
115+
116+
- Contextual Comments
117+
- Documents assumptions, preconditions, and non-obvious requirements
118+
119+
Keep comments concise and up-to-date; prefer clear code over comments when
120+
possible, and move long-form design notes into the repository docs or an
121+
appropriate design file.
122+
123+
- `python/datafusion/io.py` offers global context readers:
124+
- `read_parquet`
125+
- `read_json`
126+
- `read_csv`
127+
- `read_avro`
128+
- `python/datafusion/user_defined.py` exports convenience creators for user-defined functions:
129+
- `udf` (scalar)
130+
- `udaf` (aggregate)
131+
- `udwf` (window)
132+
- `udtf` (table)
133+
- `python/datafusion/col.py` exposes the `Col` helper with `col` and `column` instances for building column expressions using attribute access.
134+
- `python/datafusion/catalog.py` provides Python-based catalog and schema providers.
135+
- `python/datafusion/object_store.py` exposes object store connectors: `AmazonS3`, `GoogleCloud`, `MicrosoftAzure`, `LocalFileSystem`, and `Http`.
136+
- `python/datafusion/unparser.py` converts logical plans back to SQL via the `Dialect` and `Unparser` classes.
137+
- `python/datafusion/dataframe_formatter.py` offers configurable HTML and string formatting for DataFrames (replaces the deprecated `html_formatter.py`).
138+
- `python/tests/generic.py` includes utilities for test data generation:
139+
- `data`
140+
- `data_with_nans`
141+
- `data_datetime`
142+
- `data_date32`
143+
- `data_timedelta`
144+
- `data_binary_other`
145+
- `write_parquet`
146+
- `python/tests/conftest.py` defines reusable pytest fixtures:
147+
- `ctx` creates a `SessionContext`.
148+
- `database` registers a sample CSV dataset.
149+
- `src/dataframe.rs` provides the `collect_record_batches_to_display` helper to fetch the first non-empty record batch and flag if more are available.

0 commit comments

Comments
 (0)