Skip to content

Conversation

@benwilson512
Copy link

IN improvements for Ecto.CH

Hey folks, thanks for the library! We ran into a problem using the in operator over at CargoSense and we're hoping you're open to a change in the SQL those queries produce to make more efficient use of Clickhouse params.

Problem

Currently when you have an ecto query like:

ids = [1,2,3]
from q in queryable, where: q.id in ^ids

This results in clickhouse SQL:

WHERE q.id in ($0:Int64,$1:Int64,$2:Int64)

This becomes an issue when ids or any other array parameter becomes very large, which is common when using Ecto CH to perform bulk operations in Clickhouse.

Solution

Now in this PR the same Ecto query produces this Clickhouse SQL:

WHERE q.id in {$0:Array(Int64)}

We have just one parameter no matter the size of the array.


defp expr({:in, _, [_, {:^, _, [_ix, 0]}]}, _sources, _params, _query), do: "0"

defp expr({:in, _, [left, {:^, _, [ix, len]}]}, sources, params, query) when len > 0 do
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Notably this functionality only happens with ^ values, literals maintain their current functionality, which helps this remain compatible with complex uses of IN such as with subqueries.

def dumpers(:uuid, Ecto.UUID), do: [&__MODULE__.hex_uuid/1]
def dumpers(:uuid, type), do: [type, &__MODULE__.hex_uuid/1]
def dumpers(:binary_id, type), do: [type, &__MODULE__.hex_uuid/1]
def dumpers({:in, sub}, {:in, sub}), do: [{:array, sub}]
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line here is what tells Ecto itself that in ^val should result in a single parameter for val

@ruslandoga
Copy link
Collaborator

ruslandoga commented Nov 30, 2025

👋 @benwilson512

Thank you!

I'd like to convince myself that it's indeed equivalent to the previous approach (lists with nulls, different types, etc.) before merging so I'm taking some time :)


Btw, just in case, https://clickhouse.com/docs/sql-reference/operators/in suggests

Don't list too many values explicitly (i.e. millions). If a data set is large, put it in a temporary table (for example, see the section External data for query processing), then use a subquery.

But I guess this change is not about millions of values but rather maybe hundreds to thousands


The CI is failing for unrelated issues, "fixed" in #252.

@ukutaht
Copy link

ukutaht commented Dec 3, 2025

Nice! We also started recently using IN queries with fairly large payloads for the new consolidated views feature. We currently limit it to 14k sites so we have at maximum 14k IDs in the where site_id IN (list_of_ids) query.

We also tested creating dictionaries and using these for lookups so the list of IDs does not need to be part of the query payload. It was more efficient as expected but for now we didn't go for that solution to reduce maintenance burden.

tagging @aerosol. I don't think we have problems with the limit at the moment but this is interesting in case we need to bump the limit in the future.

@aerosol
Copy link
Member

aerosol commented Dec 3, 2025

Just on a quick glance, it looks like this is equivalent to the workaround we're applying to avoid hitting ClickHouse limits due to too many bindings: https://github.com/plausible/analytics/blob/fa09b73ff1acff2248a664d652f818c999c7dcca/lib/plausible/stats/sql/where_builder.ex#L54

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants