Skip to content

Releases: RumbleDB/rumble

RumbleDB 2.0.0 Lemon Ironwood

28 Aug 11:33
3a82682

Choose a tag to compare

Major release:

  • Support for the JSONiq Update Facility to write to tables managed in the Hive metastore and Delta files
  • Support for the JSONiq Scripting Extension (variable assignments, while loops, applying updates during execution, exit returning, etc)
  • Support for Python with the pip jsoniq package
  • Alpha support for XML (XQuery 3.0)
  • Automatic schema detection upon writing CSV or Parquet files. No need to specify schemas explicitly any more.

Support for Spark 4.0 and Spark 3.5 (Scala 2.13). Note that Amazon EMR does not yet support Spark 4.0 but we expect this to happen soon. EMR 7 should be used with RumbleDB 1.22 because it is on Spark 3.5 and Scala 2.12.

Java 17 or 21 is required for Spark 4.0. Java 11 or 17 is required for Spark 3.5.

Many bug fixes, enhanced schema detection.

Contributors (Ghislain Fourny's students at ETH): Stefan Irimescu, Renato Marroquin, Rodrigo Bruno, Falko Noé, Ioana Stefan, Andrea Rinaldi, Stevan Mihajlovic, Mario Arduini, Can Berker Çıkış, Elwin Stephan, David Dao, Zirun Wang, Ingo Müller, Dan-Ovidiu Graur, Thomas Zhou, Olivier Goerens, Alexandru Meterez, Pierre Motard, Remo Röthlisberger, Dominik Bruggisser, David Loughlin, David Buzatu, Marco Schöb, Maciej Byczko, Abishek Ramdas, Matteo Agnoletto, Dwij Dixit.

Main website: https://www.rumbledb.org
Documentation: https://docs.rumbledb.org
Maven repository: https://central.sonatype.com/artifact/com.github.rumbledb/rumbledb
Javadoc: https://rumbledb.org/docs/latest/api/
Python package: https://pypi.org/project/jsoniq/

RumbleDB 1.23.0 "Mountain ash" beta

26 Mar 14:33
9fcf2e0

Choose a tag to compare

Update (July 3, 2025): Spark 4.0 support is available.

Use RumbleDB to query data with JSONiq, even data that does not fit in DataFrames.

Try-it-out sandbox: https://colab.research.google.com/github/RumbleDB/rumble/blob/master/RumbleSandbox.ipynb

Instructions to get started: https://rumble.readthedocs.io/en/latest/Getting%20started/

Supported versions
RumbleDB 1.23 supports Spark 3.5 with Scala 2.13 as well as Spark 4.0.
The jars are compatible with Java 11 and 17. As we are increasingly focusing our efforts towards the Spark 4 release and stability and conformance improvements and as Spark 4 is based on Scala 2.13, RumbleDB 1.23 support for Spark 3.4 as well as Scala 2.12 is dropped. Please use RumbleDB 1.22, which is stable, if you use Spark 3.4 or Spark 3.5 with Scala 2.12.

The standalone jar contains Spark 3.5 with Scala 2.13 and will thus just work.

General

  • Dropped support for Scala 2.12.
  • Dropped support for Spark 3.4
  • Renamed json-file() to json-lines(), old name can still be used for now but is marked deprecated
  • Added support for single quotes '. Strings with single quotes may contain double quotes ", but single quotes inside need to be escaped using \'. Analogous, strings with double quotes may contain single quotes, but double quotes inside need to be escaped using \"
  • Add support for some popular features of pandas/numpy libraries

JSONiq 3.1

Added option to use JSONiq 3.1 which brings changes to the JSONiq 1.0 spec to align it closer with XQuery 3.1. Enabling the option results in the following changes:

  • Objects and Arrays now have no effective boolean value and throw an error when checked
  • Keys for objects must be quoted
  • atomic is replaced by anyAtomicType
  • Remove JNDY0003 and replace it with XQDY0137
  • Both the JSONiq and XQuery parsers are available. The parser to use can be selected on the command line or with a language declaration in the query file.

Basic XML/XQuery support for both parsers

  • Add doc() function for reading an XML document
  • Add a new xml-files() function that allows for reading and processing of multiple .xml files in parallel
  • Add XPath steps for navigating XML documents. We are able to navigate through 32+ GB of XML data spread over many documents in just a few minutes on an Amazon EMR cluster.
  • Add data() function for atomization of nodes

Experimental XQuery Parser

Updated option to use XQuery parser instead of JSONiq. To use it, just prefix your query with xquery version "3.1";. Note: this is in a very early state and many features are still missing.

  • Context item is "." as opposed to "$$" from JSONiq
  • No JSONiq ObjectLookups with "."
  • No JSONiq ArrayLookup and ArrayUnboxing
  • Support for XQuery Map constructor and curly Array constructor
  • Support for String Lookup on Maps and Integer lookup on arrays with the ? operator

Minor Improvements and Bug fixes

  • subsequence and sequencelookups now use Spark pagination for large positions
  • Rumble shell now keeps history of previous sessions
  • Implements compare() with arities 2 and 3
  • Implements trace() arity 2
  • Implements xs:numeric
  • Adds support for setting base-uri in query and as CLI option
  • Implement FOAR0002, FOAY0001, FOTY0013, FODT0001, FODT0002, XPTY0018, XPTY0019, XQST0032
  • Increase decimal multiplication precision to 18 digits
  • Fixes index lookup with an index >= 1'000'000 throwing an error and incorrect behaviour with non-integer
  • Fixes calling parallelize on an already parallelized structure throwing an error
  • Fixes index lookup with decimal not adhering to spec
  • Fixes unnecessary warning shown when
  • Fixes effective boolean value of NaN and decimals equal to 0
  • Fixes stringToCodepoints() on multibyte ranges
  • Fixes indexof() shoudn't find NaN
  • Fixes some base64 errors
  • Fixes some edgecases in pow, log10, exp10, atan
  • Fixes resolveUri with empty baseUri
  • Fixes some incorrect exceptions of matches()
  • Fixes sum() with zeroElement not behaving correctly if sequence is non-empty
  • Fixes idiv and imult handling of inf and NaN
  • Fixes inner focus sometimes missing in simpleMap
  • Fixes bug allowing missing commas between function arguments

RumbleDB 1.22.0 "Pyrenean oak" beta

24 Oct 14:18
90a7faa

Choose a tag to compare

Use RumbleDB to query data with JSONiq, even data that does not fit in DataFrames.

Try-it-out sandbox: https://colab.research.google.com/github/RumbleDB/rumble/blob/master/RumbleSandbox.ipynb

Instructions to get started: https://rumble.readthedocs.io/en/latest/Getting%20started/

Supported Java versions

The jars are compatible with Java 11. Support for Java 8 is dropped.

Supported Spark versions

Spark 3.2 and 3.3 are no longer supported as of RumbleDB 1.22, as they are no longer supported officially by the Spark team. Spark 3.4 and 3.5 are supported. Spark 4 is currently in preview and not supported yet by RumbleDB, but we are currently trying it out in order to support in future releases.

Jars

RumbleDB comes in 3 jars that you can pick from depending on your needs:

rumbledb-1.22.0-standalone.jar contains Spark already and can simply be run "out of the box" with java -jar rumbledb-1.22.0-standalone.jar with Java 11.

rumbledb-1.22.0-for-spark-3.4-scala-2-12.jar, rumbledb-1.22.0-for-spark-3.5-scala-2-12.jar, and rumbledb-1.22.0-for-spark-3.5-scala-2-13.jar are smaller in size, do not contain Spark, and can be run in a corresponding, existing Spark environment either local (so you need to download and install Spark) or on a cluster (EMR with just a few clicks, etc) with spark-submit rumbledb-....jar -q '1+1'

Improvements

Support for the W3C-standardized copy-modify-return expression as a more convenient way to transform JSON objects and arrays with the update syntax (insertion, deletion, replacement, renaming)
Support for the persistence of updates on objects and arrays read from the DeltaLake (with the same update syntax)
Support for scripting: variable assignments, while loops, applying updates in the middle of the execution with visible side effects (under snapshot semantics), statements, block statements, continue, break, exit returning.
Many performance improvements
Many bugfixes

RumbleDB 1.21.0 "Hawthorn blossom" beta

16 May 13:09
53f4df0

Choose a tag to compare

NEW! The jar for Spark 3.5 was added and is available for download.

Use RumbleDB to query data with JSONiq, even data that does not fit in DataFrames.

Try-it-out sandbox: https://colab.research.google.com/github/RumbleDB/rumble/blob/master/RumbleSandbox.ipynb

Instructions to get started: https://rumble.readthedocs.io/en/latest/Getting%20started/

Spark 3.0 and 3.1 are no longer supported as of RumbleDB 1.21, as they are no longer supported officially by the Spark team. Spark 3.4 is newly supported.

RumbleDB comes in 4 jars that you can pick from depending on your needs:

rumbledb-1.21.0-standalone.jar contains Spark already and can simply be run "out of the box" with java -jar rumbledb-1.21.0-standalone.jar with Java 8 or 11.
rumbledb-1.21.0-for-spark-3.X.jar (3.2, 3.3, 3.4) is smaller in size, does not contain Spark, and can be run in a corresponding, existing Spark environment either local (so you need to download and install Spark) or on a cluster (EMR with just a few clicks, etc) with spark-submit rumbledb-1.21.0-for-spark-3.X.jar

Improvements

  • Automatically parallelizes range expressions with more than a million items with no need to call parallelize() any more.
  • some simple map expressions on homogeneous input are now faster (native SQL behind the scene).
  • general comparisons on equality are now considerably faster
  • reverse() is now more efficient and faster on homogeneous sequences
  • Fixed bug on equijoin involving homogeneous sequences
  • Add two functions jn:cosh and jn:sinh
  • Automatic optimization of general comparisons to value comparisons when it is detected that the sequences have at most one item (can be deactivated with --optimize-general-comparison-to-value-comparison on)
  • Better static type detection
  • It is now possible to force a sequential execution (without Spark) with --parallel-execution no. This also works with queries containing calls to parallelize() (which will be ineffective), json-doc(), and json-file() (which will simply stream-read from the disk). Other I/O functions (such as csv-file(), etc) will still involve Spark for reading, but immediately materialize for the rest of the execution.
  • It is now possible to deactivate Native Spark SQL execution (forcing a fallback to the use of UDFs by RumbleDB) with --native-execution no.
  • annotate expression (similar syntax to validate expression) allows directly annotating an item without checking for validity.
  • More static types are detected
  • Non-recursive functions are now automatically inlined for faster execution. This can be deactivated with --function-inlining no (reverting to behavior in previous versions)
  • TypeSwitch expressions now support DataFrame execution

Bugfixes

  • Fixed bug when reading longs from DataFrames
  • Fixed an issue with projection pushdowns in join queries
  • Fixed a few bugs with queries that navigate JSON in for clauses; they are compiled to native SQL whenever possible, but some chains were throwing errors (e.g., an array unboxing followed by object lookup)
  • Fixed a bug in which calling count() on a grouping variable did not return 1 when native SQL execution is activated
  • hexBinary and base64Binary values can now be used in order by clauses with parallel execution

RumbleDB 1.20.0 "Honeylocust"

07 Nov 12:57
38e07ca

Choose a tag to compare

Use RumbleDB to query data with JSONiq, even data that does not fit in DataFrames.

Try-it-out sandbox: https://colab.research.google.com/github/RumbleDB/rumble/blob/master/RumbleSandbox.ipynb

Instructions to get started: https://rumble.readthedocs.io/en/latest/Getting%20started/

Spark 3.0 and 3.1 are no longer supported as of RumbleDB 1.20, as they are no longer supported officially by the Spark team.

RumbleDB comes in 4 jars that you can pick from depending on your needs:

rumbledb-1.20.0-standalone.jar contains Spark already and can simply be run "out of the box" with java -jar rumbledb-1.20.0-standalone.jar with Java 8 or 11.
rumbledb-1.20.0-for-spark-3.X.jar (3.2, 3.3) is smaller in size, does not contain Spark, and can be run in a corresponding, existing Spark environment either local (so you need to download and install Spark) or on a cluster (EMR with just a few clicks, etc) with spark-submit rumbledb-1.20.0-for-spark-3.X.jar

New features:

  • Open and query YAML files (also with multiple documents) with yaml-doc()
  • Serialize the output of your queries to YAML with --output-format yaml
  • General comparisons (existential quantification on large sequences) now work with very big sequences and are automatically pushed down to Spark.

Bugfixes:

  • Fixed an issue preventing reading Decimal types from Parquet with some precisions and ranges
  • Fixed a few bugs in static typing
  • Fixed a bug that didn't throw an error when using the concatenation operator || on sequences with more than one item

RumbleDB 1.19.0 "Tipuana Tipu"

14 Jun 13:17
cd6684b

Choose a tag to compare

RumbleDB allows you to query data that does not fit in DataFrames with JSONiq.

Try-it-out sandbox: https://colab.research.google.com/github/RumbleDB/rumble/blob/master/RumbleSandbox.ipynb

Instructions to get started: https://rumble.readthedocs.io/en/latest/Getting%20started/

RumbleDB comes in 4 jars that you can pick from depending on your needs:

  • rumbledb-1.19.0-standalone.jar contains Spark already and can simply be run "out of the box" with java -jar rumbledb-1.19.0-standalone.jar with Java 8 or 11.
  • rumbledb-1.19.0-for-spark-3.X.jar (3.0, 3.1, 3.2, 3.3) is smaller in size, does not contain Spark, and can be run in a corresponding, existing Spark environment either local (so you need to download and install Spark) or on a cluster (EMR with just a few clicks, etc) with spark-submit rumbledb-1.19.0-for-spark-3.X.jar

Release notes:

  • Fixed the bug with spaces in paths
  • Various fixes and enhancement
  • New functions repartition#2 to change the number of physical partitions, and binary-classification-metrics#3, binary-classification-metrics#4 for preparing ROC curves, PR curves to evaluation the output of ML pipelines.

RumbleDB 1.18.0 "Scarlet Ixora" beta

12 Apr 14:55
52a3424

Choose a tag to compare

Instructions to get started: https://rumble.readthedocs.io/en/latest/Getting%20started/

RumbleDB comes in 4 jars that you can pick from depending on your needs:

  • rumbledb-1.18.0-standalone.jar contains Spark already and can simply be run "out of the box" with java -jar rumbledb-1.18.0-standalone.jar with Java 8 or 11.
  • rumbledb-1.18.0-for-spark-3.X.jar (3.0, 3.1, 3.2) is smaller in size, does not contain Spark, and can be run in a corresponding, existing Spark environment either local (so you need to download and install Spark) or on a cluster (EMR with just a few clicks, etc) with spark-submit rumbledb-1.18.0-for-spark-3.X.jar

Release notes:

  • FLWOR expressions starting with a series of let are now better optimized and faster.
  • A warning with advice is issued in the command window if a group by is used in a FLWOR expression that starts with a let clause.
  • The shell will no longer exit when an error is thrown.
  • When a query cannot be executed in parallel, a more informative error message is output inviting the user to rewrite their query, instead of the raw Spark error.
  • When launching in shell or server mode, instructions are printed on the screen for next steps
  • Fixed crash in the execution of some where clauses when a join was not successfully detected and it falls back to linear execution
  • Support for context item declarations and passing an external context item value on the command line
  • By default, the date type no longer supports timezones (which are rarely used for this type, although supported by ISO 8601). This enables more optimizations (e.g., internal conversion to DataFrame DateType columns and export of datasets with dates to Parquet). Timezones on dates can be activated for those users who need them with a simple CLI argument (--dates-with-timezone yes).
  • Ctrl+C now elegantly exits the shell.

RumbleDB 1.17.0 "Cacao tree" beta

02 Feb 10:41
02b7b3b

Choose a tag to compare

Instructions to get started: https://rumble.readthedocs.io/en/latest/Getting%20started/

  • The CLI was extended with verbs (run, serve, repl) and single-dash shortcuts (-f for --output-format, etc). This is backward compatible.
  • Automatic internal conversion to DataFrames for FLWOR expressions executed in parallel when the statically inferred type is DataFrame-compatible.
  • Fixed bug that prevented calling a variable $type or lookup up a field called "type" without quotes.
  • Fixed but for projecting a sequence internally stored as a DataFrame to dynamically defined keys.
  • Fix some bugs with post-grouping count optimizations on let variables
  • Support for Spark 2.4, which is no longer maintained by the Spark team, is now dropped, but available on request. RumbleDB 1.17 supports Spark 3.0, 3.1 and 3.2.
  • plenty of smaller bug fixes
  • [Experimental] we also provide a jar that embeds Spark and does not require its installation (rumbledb-1.17.0-standalone.jar). It is for use on a local machine only (not a cluster) and works with java -jar rumbledb-1.17.0-standalone.jar run -q '1+1' rather with spark-submit. Feedback is welcome! This is just experimental at this point and we will take it from there.

RumbleDB 1.16.2 "Shagbark Hickory" beta

09 Dec 10:14

Choose a tag to compare

Pre-release

Interim release.

  • Fix recursive view "input" issue.
  • Nicer message for out of memory errors and hint to use CLI parameters.
  • Reverted to Kryo 4 for Spark 3.2, which depends on Twitter Chill 0.10.0 using this version of Kryo in a way incompatible with Kryo5

Rumble 1.16.1 "Shagbark Hickory" beta

06 Dec 08:46
b703258

Choose a tag to compare

Pre-release

Interim release.

  • Fixed race condition issue with min() and max() called multiple times that led to possibly incorrect output.
  • The sum() and count() functions are now able to stream locally on very large (non parallelized) sequences.
  • Range expressions now support 64 bit integers as well (before this, an overflow happened)
  • The arrow syntax works for dynamic function calls, too, so in Rumble ML pipelines can also be called with a pipelining syntax: $training-set=>$my-transformer($params)=>my-estimator($params)
  • substring() was fixed to follow standard behavior even with exotic parameters (mostly returning an empty string in these cases)