Releases: Eventual-Inc/Daft
v0.6.7
What's Changed 🚀
💥 Breaking Changes
- feat!: Catch transient errors on turbopuffer writes @desmondcheongzx (#5380)
✨ Features
- feat: add viz for embedding @samster25 (#5419)
- feat!: Catch transient errors on turbopuffer writes @desmondcheongzx (#5380)
- feat(dashboard): Cleanup Queries Page @srilman (#5416)
- feat: Extend hash variants for xxhash @srilman (#5276)
- feat: prompt @colin-ho (#5394)
- feat: Add case function for better SQL-style conditional expressions @rasanpreetsingh3 (#5383)
🐛 Bug Fixes
- fix: Reduce number udfs by 1 in multi udf test @colin-ho (#5414)
- fix: Wrap azure fsspec in pafs.FSSpecHandler @colin-ho (#5412)
- fix(flotilla): Set flotilla actor cpu requests to 1 @colin-ho (#5404)
- fix: Fix prompt integration tests @colin-ho (#5401)
- fix: Fix Operator Finalization in Swordish Stat Manager @srilman (#5398)
🚀 Performance
♻️ Refactor
📖 Documentation
- docs: Fix some document errors @plotor (#5409)
- docs: update minhash example to use cc dataset @everettVT (#5390)
- docs: fix daft.File usage examples @kevinzwang (#5403)
👷 CI
- ci: Remove Tests for the Old Ray Runner @srilman (#5374)
- ci: disable running tpch profiling on push @kevinzwang (#5384)
🔧 Maintenance
- chore: bump pyo3 dependency @universalmind303 (#5410)
- chore: revert #5383 @kevinzwang (#5396)
- chore: optimize operator naming @Jay-ju (#5204)
Full Changelog: v0.6.6...v0.6.7
v0.6.6
What's Changed 🚀
💥 Breaking Changes
- docs!: update docstrings for various functions @universalmind303 (#5344)
✨ Features
- feat: Explicit AWS vs. HTTP mode for common crawl dataset @malcolmgreaves (#5379)
- feat: pydantic model type conversion @kevinzwang (#5370)
- feat(dashboard): Individual Query Page @srilman (#5367)
- feat(flotilla): Flotilla sort merge join @colin-ho (#5369)
- feat: batch UDF with
@daft.func.batch@kevinzwang (#5362) - feat(dashboard): Queries Page @srilman (#5257)
- feat: more tensor conversions @kevinzwang (#5357)
- feat: @daft.cls decorator for new class UDFs @kevinzwang (#5350)
- feat: Detect concurrency / num gpus for model apis @colin-ho (#5342)
- feat: Flotilla linear scheduler @colin-ho (#4378)
- feat: Lazy
from_glob_path@colin-ho (#5235)
🐛 Bug Fixes
- fix: Use sum supertype for
list_sumtype inference @colin-ho (#5366) - fix: Use default io config in read video if not passed in @colin-ho (#5364)
- fix: support serialize and deserialize LazyImport @stayrascal (#5361)
- fix(file): python expects bytes instead of None @universalmind303 (#5348)
- fix: read_video_frames handles EOF gracefully @rchowell (#5343)
🚀 Performance
- perf(flotilla): Throttle worker refresh and autoscaling @colin-ho (#5351)
- perf: Elide shuffle for distinct if input is already partitioned @colin-ho (#5354)
- perf: use bincode instead of python for io_conf serialization in FileArray @universalmind303 (#5340)
- perf: Only Serialize Required Cols in Process UDFs @srilman (#5069)
📖 Documentation
- docs: Update Common Crawl Dataset docs to make AWS region explicit @desmondcheongzx (#5373)
- docs: add Flotilla blog post links to AI benchmarks @ykdojo (#5359)
- docs: Revamp optimization docs @colin-ho (#5347)
- docs!: update docstrings for various functions @universalmind303 (#5344)
👷 CI
- ci: Fix ai benchmark workflow @colin-ho (#5363)
- ci: Pin pydantic version in
provision.pyfor iceberg tests @colin-ho (#5352) - ci: Add ai benchmarks ci @colin-ho (#5337)
Full Changelog: v0.6.5...v0.6.6
v0.6.5
What's Changed 🚀
💥 Breaking Changes
- refactor!: make daft.File immutable @universalmind303 (#5288)
✨ Features
- feat: add
use_processflag for@daft.func(...)@universalmind303 (#5323) - feat: Dashboard Query Subscriber @srilman (#5266)
- feat: Subscriber Framework @srilman (#5210)
- feat: make file-array serializable @universalmind303 (#5304)
- feat: add count() pushdown optimization in Iceberg datasource @huleilei (#5029)
🐛 Bug Fixes
- fix: Fix the make docs warnings @colin-ho (#5328)
- fix: Iterate on the patched [email protected] @desmondcheongzx (#5322)
- fix: decimal format for handling scientific notation @rchowell (#5303)
- fix: Use patched [email protected] for AKS Workload Identity credentials to continue working > 24 hours @desmondcheongzx (#5299)
🚀 Performance
- perf: more literal optimizations @universalmind303 (#5314)
- perf: Support parallel CSV parsing when files contain carriage returns @desmondcheongzx (#5319)
♻️ Refactor
- refactor!: make daft.File immutable @universalmind303 (#5288)
📖 Documentation
- docs: add casting matrix @kevinzwang (#5333)
- docs: add daft.func docs page and APIs @kevinzwang (#5335)
- docs: Update links for running Daft in distributed mode @desmondcheongzx (#5334)
- docs: Fix broken links on minhash example @colin-ho (#5326)
- docs: Add architecture docs @colin-ho (#5320)
- docs: Add docs to broken link checker @colin-ho (#5324)
- docs: Clean up AGENTS.md structure @ykdojo (#5321)
- docs: Add Kubernetes quickstart to Daft docs @jeevb (#5318)
- docs: Add docs and values reference for quickstart chart @jeevb (#5313)
- docs: Fix broken link in Common Crawl dataset docs @desmondcheongzx (#5301)
- docs: Document Common Crawl dataset @desmondcheongzx (#5300)
👷 CI
- ci: Only fail broken link checker on 404s @colin-ho (#5327)
- ci: fix property test column name @kevinzwang (#5325)
🔧 Maintenance
- chore: Refactor DistributedPipelineNode to implement TreeDisplay @srilman (#5315)
- chore: Enable interactive html for
df.__repr_html__@colin-ho (#5312)
Full Changelog: v0.6.4...v0.6.5
v0.6.4
What's Changed 🚀
💥 Breaking Changes
- feat!: unify Python -> Daft type conversions @kevinzwang (#5201)
✨ Features
- feat: Add rows written stat for sinks @colin-ho (#5285)
- feat: Add Common Crawl dataset @desmondcheongzx (#5244)
- feat!: unify Python -> Daft type conversions @kevinzwang (#5201)
- feat: when function @kevinzwang (#5283)
- feat: Add first draft of k8s-quickstart helm chart @jeevb (#5272)
- feat: add support for pyiceberg 0.10.0 @gmweaver (#5277)
🐛 Bug Fixes
- fix: Pass io config when grabbing Common Crawl manifest @desmondcheongzx (#5294)
- fix: Use {} for dashboard dynamic route @colin-ho (#5289)
📖 Documentation
- docs: Update readme with new benchmarks @colin-ho (#5281)
- docs: fixed typo on s3 section of docs @destroyer22719 (#5280)
✅ Tests
- test: Temporarily remove Common Crawl integration test @desmondcheongzx (#5296)
- test: Complete coverage for array comparisons @desmondcheongzx (#5286)
🔧 Maintenance
- chore: Fix imports to not import pyarrow @colin-ho (#5290)
- chore: remove ChanChan from PR checklist @kevinzwang (#5282)
Full Changelog: v0.6.3...v0.6.4
v0.6.3
What's Changed 🚀
✨ Features
- feat: allow PythonArray to be serialized @kevinzwang (#5270)
- feat: support pyarrow.schema in write_lance api @huleilei (#5247)
- feat(dashboard): Add verbose flag @srilman (#5267)
- feat: file_size expr @universalmind303 (#5243)
- feat: Use separate runtime and port for detached dashboard @srilman (#5263)
- feat: expression avg alias @destroyer22719 (#5252)
- feat: add df constructors for daft.File @universalmind303 (#5074)
- feat: Add WARC-Target-URI as a top-level column for WARC reads @desmondcheongzx (#5254)
- feat: Force build side for broadcast joins @colin-ho (#5238)
🐛 Bug Fixes
- fix: Increase actor udf readiness timeout @colin-ho (#5258)
- fix: Ensure super extension type is registered in to_arrow_dtype @ConeyLiu (#5265)
🚀 Performance
♻️ Refactor
📖 Documentation
- docs: improve copy and consistency on index page @ykdojo (#5271)
- docs: Update AI benchmarks @colin-ho (#5264)
- docs: AI benchmarks @colin-ho (#5245)
🔧 Maintenance
- chore(deps): Upgrade aws lc rs @colin-ho (#5274)
- chore(deps): bump the minor group across 1 directory with 51 updates @dependabot[bot] (#5262)
- chore: Add warning for smj @colin-ho (#5268)
- chore(flotilla): Remove stages @srilman (#5222)
- chore: Fix more dependabot security warnings @desmondcheongzx (#5260)
- chore: Fix dependabot h11 security warning @desmondcheongzx (#5259)
- chore: remove old documentation about daft cli @universalmind303 (#5241)
⬆️ Dependencies
- chore(deps): bump the minor group across 1 directory with 51 updates @dependabot[bot] (#5262)
Full Changelog: v0.6.2...v0.6.3
v0.6.2
What's Changed 🚀
✨ Features
- feat: add File.to_tempfile method and optimize range requests @universalmind303 (#5226)
- feat: migrate nested constructors, get, slice, to_unix_epoch, partitioning functions @kevinzwang (#5233)
- feat: implement Lance filter+count pushdown optimization @huleilei (#5152)
- feat: migrate remaining basic expressions @kevinzwang (#5234)
- feat: Add regexp_count @srilman (#5191)
- feat: retry eof error in S3 multipart upload @kevinzwang (#5179)
- feat(flotilla): Distributed Pivot @srilman (#5199)
- feat: allow for functions module to be accessed from daft import @kevinzwang (#5219)
- feat: migrate first batch of daft functions @kevinzwang (#5086)
- fix: remove io_config from global scan operator @Jay-ju (#4676)
- feat: implement cross join in distributed flotilla engine @ohbh (#5180)
- feat: make multipart part size configurable @stayrascal (#5051)
🐛 Bug Fixes
- fix: write_iceberg should handle write.data.path @eric-maynard (#5153)
- fix: Cast to table schema in lance write @colin-ho (#5221)
- fix: rename index function to find @kevinzwang (#5218)
- fix: remove unnecessary field renaming in ListMap scalar UDF @ltdthanhdat (#5205)
- fix: remove io_config from global scan operator @Jay-ju (#4676)
🚀 Performance
📖 Documentation
- docs: add file datatype to urls&files docs page @universalmind303 (#5240)
- docs: add pipeline visualization and fix grammar in batch inference docs @ykdojo (#5239)
- docs: add Examples link to getting started sections @ykdojo (#5202)
- docs: fix additional incorrect references to Iceberg as a catalog @ykdojo (#5198)
- docs: clarify data catalog vs table format in README @ykdojo (#5197)
- docs: remove outdated engine comparison page @ykdojo (#5196)
- docs: Add embed_image to the user guide @desmondcheongzx (#5195)
- docs: update minhash dedupe example and tutorial @everettVT (#5192)
- docs: Add Minhash Dedupe Example @everettVT (#5165)
👷 CI
- ci: enable AI integration tests in PR @kevinzwang (#5217)
- ci: move transformers tests to integration tests @rchowell (#5186)
🔧 Maintenance
- chore: Don't log failures in url download if
on_error="null"@colin-ho (#5231) - chore: Make the runner a separate global singleton @srilman (#5185)
- chore: remove expr.str.regexp_split since we will immediately deprecate it @kevinzwang (#5216)
- chore: Rewrite the dashboard server to use Axum @srilman (#5212)
- chore: Remove
reset_runneroption @srilman (#5184) - chore: Split
.str.splitinto.str.splitand.str.regexp_split@srilman (#5211) - chore: enable logging configuration during launch udf worker @stayrascal (#5168)
⏪ Reverts
- revert: "fix: remove io_config from global scan operator" @universalmind303 (#5203)
Full Changelog: v0.6.1...v0.6.2
v0.6.1
What's Changed 🚀
✨ Features
- feat: expose image attribute as expression @Jay-ju (#4848)
- feat(flotilla): no shuffle for hash join if conditions are met @colin-ho (#5135)
- feat:
.list.appendExpression @srilman (#5159) - feat: Base64 Encoding @srilman (#5158)
- feat: adds support for classify_text @rchowell (#5113)
- feat: Add Arrow IPC conversion for RecordBatches @srilman (#5143)
- feat: unnest param on @daft.func @kevinzwang (#5132)
🐛 Bug Fixes
- fix: Account for unschedulable udf actors @colin-ho (#4987)
- fix: Cleanup CLI Progress Bar Output @srilman (#5157)
- fix: flaky test test_transformers_image_embedder_other @kevinzwang (#5130)
🚀 Performance
📖 Documentation
- docs: improve text readability on examples page @ykdojo (#5182)
- docs: add TrendShift badge to README @ykdojo (#5181)
- docs: improve explode method documentation with null/empty list examples @ykdojo (#5164)
- docs: fix broken tutorial links and remove redundant file @ykdojo (#5154)
👷 CI
- ci: Reduce free disk space time @colin-ho (#5178)
- ci: re-add mac os unit tests on main @kevinzwang (#5163)
- ci: fix TPC-H benchmark workflows @kevinzwang (#5123)
- ci: Pipe unit test failures through duration aggregator @colin-ho (#5161)
- ci: remove macos from PR test suite @kevinzwang (#5142)
- ci: Aggregate test durations @colin-ho (#5129)
🔧 Maintenance
Full Changelog: v0.6.0...v0.6.1
v0.6.0
What's Changed 🚀
v0.6.0 marks the official release of our new ray-based distributed engine, Flotilla! If you are already using the ray runner, you do not need to change anything. Setting the DAFT_RUNNER=ray environment variable, or within your python program via daft.context.set_runner_ray(), will use Flotilla by default.
All operations except cross join, sort merge join, and pivot are currently supported. We will be working on adding support for them soon! If you need to use the legacy ray runner, please set daft.set_execution_config(use_legacy_ray_runner=True)
💥 Breaking Changes
SQLCatalog was deprecated in v0.5 and is now removed, in favor of the bindings kwargs.
Before:
catalog = SQLCatalog({"test_data": df})
result = daft.sql("SELECT * FROM test_data", catalog=catalog)After:
bindings = {"test_data": df}
result = daft.sql("SELECT * FROM test_data", **bindings)- feat!: revert daft.func behavior on literal arguments @kevinzwang (#5087)
- revert!: "revert: Temporarily revert "Remove deprecated APIs for 0.6" @desmondcheongzx (#5084)
✨ Features
- feat(embed_text): Support LM Studio as a provider @desmondcheongzx (#5103)
- feat: Implement embed_image() @desmondcheongzx (#5101)
- feat!: revert daft.func behavior on literal arguments @kevinzwang (#5087)
- feat: Automatically grab embedding dimensions for sentence transformers @desmondcheongzx (#5078)
- feat: add mcap datasource reader @Jay-ju (#4727)
🐛 Bug Fixes
- fix: Undo skipcheck change @srilman (#5131)
- fix: fix youtube video reading @rchowell (#5126)
- fix: Remove flotilla fallback @colin-ho (#5114)
- fix: Add nulls in json reads if a line doesn't contain the field from the schema @colin-ho (#4993)
- fix: Check if UDFs are Serializable @srilman (#5091)
- fix: nightly property test @malcolmgreaves (#5076)
- fix: Handle Unserializable Errors in Process UDFs @srilman (#5075)
- fix: Implement Multi-Column Aggregations with List-like columns @srilman (#5017)
🚀 Performance
- perf: Implement count pushdown for parquet @desmondcheongzx (#5038)
- perf(flotilla): Use Worker Affinity with Pre-Shuffle Merge @srilman (#5112)
- perf: Split UDFs from Filters @srilman (#5070)
- perf(embed_text): Let Sentence Transformers select the best available device @desmondcheongzx (#5082)
♻️ Refactor
📖 Documentation
- docs: fix navigation labels to match section names @ykdojo (#5121)
- docs: fix flickering typewriter animation on overview page @ykdojo (#5118)
- docs: Add batch inference use case @desmondcheongzx (#5116)
- docs: Add docs for custom data sources and sinks @desmondcheongzx (#5115)
- docs: add dark mode support for Algolia DocSearch @ykdojo (#5109)
- docs: add noindex tag to non-stable pages @jaychia (#5105)
- docs: Add text guide @desmondcheongzx (#5102)
- docs: Improve installation instructions @desmondcheongzx (#5094)
- docs: More fixes to the overview page in light mode @desmondcheongzx (#5095)
- docs: Document write_turbopuffer in the user guide @desmondcheongzx (#5092)
👷 CI
- ci: fix test-wheels job in build-wheel.yml @kevinzwang (#5134)
- ci: Truncate the # of concurrent jobs in PR CI @srilman (#5122)
- ci: Run tests before publish @colin-ho (#5009)
- ci: Always run the
unit-testsrequired check @colin-ho (#5119) - ci: Do not skip postmerge tests @desmondcheongzx (#5096)
🔧 Maintenance
- chore: Add AGENTS.md @srilman (#5124)
- chore: Remove docs codeowners @desmondcheongzx (#5111)
- chore: Clean up write_turbopuffer guide @desmondcheongzx (#5093)
⏪ Reverts
- revert!: "revert: Temporarily revert "Remove deprecated APIs for 0.6" @desmondcheongzx (#5084)
Full Changelog: v0.5.22...v0.5.23
v0.5.22
What's Changed 🚀
💥 Breaking Changes
- refactor!: use struct datatype as daft representation of tuples @universalmind303 (#5030)
✨ Features
- feat: Add uv.lock to git @desmondcheongzx (#5065)
- feat: Add Hash Function Support for Decimal128, Time, Timestamp, Timestamptz Datatypes @Zyiqin-Miranda (#5026)
- feat: pushdown for lance scan @Jay-ju (#4710)
- feat: add lance merge_column task @Jay-ju (#5008)
- feat: Make the max parallel of scan tasks configurable for Native Runner @plotor (#5018)
- feat: basic generator udf @kevinzwang (#5036)
- feat: implements an openai provider with embed_text @rchowell (#4997)
- feat: daft.File object store support @universalmind303 (#5002)
🐛 Bug Fixes
- fix: Fix venv command for windows build @colin-ho (#5073)
- fix: add setuptools_scm to build wheel requirements @colin-ho (#5072)
- fix: Use cachebusting and range request fallback for HTTP requests to Hugging Face CDNs @desmondcheongzx (#5061)
- fix: Use async for starting and calling udf actors in flotilla @colin-ho (#5000)
- fix: Always refresh tqdm when updating total @colin-ho (#5033)
- fix: Fix docs build @desmondcheongzx (#5066)
- fix: require uv as prerequisite for development setup @ykdojo (#5059)
- fix: Add missing source command in Makefile install-docs-deps target @ykdojo (#5060)
- fix: Mermaid syntax error when enable explain analyze for Native Runner @plotor (#5052)
- fix: clean notebook output before running tests & tweak doc proc notebook @malcolmgreaves (#5055)
- fix: correct Modin query optimizer value in comparison tables @ykdojo (#4983)
- fix: skip credentialed tests if not from main @rchowell (#5048)
- fix: subprocess UDF inherits current process env @rchowell (#5047)
- fix: sql/spark read_iceberg and read_deltalake @kevinzwang (#5035)
- fix(blc): Disabled pipefail @rohitkulshreshtha (#5031)
♻️ Refactor
- refactor!: use struct datatype as daft representation of tuples @universalmind303 (#5030)
📖 Documentation
- docs: Make overview page legible for light mode @desmondcheongzx (#5067)
- docs: Move custom python code higher up in docs @desmondcheongzx (#5064)
- docs: Add better description in overview page @jaychia (#5063)
- docs: remove core_concepts.md and broken anchor link references @ykdojo (#5062)
- docs: fix formatting @rchowell (#4994)
- docs: remove runllm widget @ccmao1130 (#5056)
- docs: add reo script to docs @ccmao1130 (#5049)
- docs: fix broken UDF link due to core_concepts.md redirect @ykdojo (#5022)
- docs: fix typo "Github" --> "GitHub" @metonym (#5025)
- docs: fix
df.limitlink in quickstart.md @rockokw (#5013)
👷 CI
- ci: Don't run pr test suite on non-code changes fr @desmondcheongzx (#5057)
🔧 Maintenance
- chore: Remove deprecated APIs for 0.6 @colin-ho (#5050)
- chore: disable hugging face library progress bars @kevinzwang (#5040)
- chore: relax assertion in flaky sharding distribution test @Jay-ju (#5053)
- chore(dev): use pyproject.toml to manage the dev dependencies @xy-xin (#4849)
- chore: random the counter during creating DistributedActorPoolProject… @stayrascal (#5039)
⏪ Reverts
- revert: Temporarily revert "Remove deprecated APIs for 0.6" @desmondcheongzx (#5068)
Full Changelog: v0.5.21...v0.5.22
v0.5.21
What's Changed 🚀
✨ Features
- feat: Propagate morsel size top-down in swordfish @colin-ho (#4894)
- feat: DataFrame.write_huggingface @kevinzwang (#5015)
🐛 Bug Fixes
- fix(blc): Attempt to fix the broken link checker. @rohitkulshreshtha (#5010)
- fix: Print UDF stdout and Daft logs above the progress bar @srilman (#4861)
📖 Documentation
- docs: Add audio transcription example card @desmondcheongzx (#5020)
- docs: improve audio transcription example @universalmind303 (#4990)
- docs: Spice up the examples page @desmondcheongzx (#5019)
🔧 Maintenance
Full Changelog: v0.5.20...v0.5.21