-
Notifications
You must be signed in to change notification settings - Fork 459
Description
I am executing a performance benchmarking for Apache AGE by using goodreads dataset. I've found a significant performance gap between a direct SQL query and its AGE/Cypher equivalent, specifically with aggregation, grouping, and ordering on a large dataset.
Here is my graph design:
Fast Direct PostgreSQL Query (3 seconds):
select count(*), u.id from "User" u, "HAS_INTERACTION" h, "Book" b where u.id = h.start_id and b.id = h.end_id GROUP BY u.id ORDER BY 1 DESC LIMIT 10;
Slow AGE/Cypher Query (50+ seconds):
SELECT * FROM cypher('goodreads_graph', $$ MATCH (u:User)-[:HAS_INTERACTION]->() RETURN u.user_id, count(*) ORDER BY count(*) DESC LIMIT 10 $$) AS (user_id agtype, interaction_count agtype);
I expect AGE/Cypher to be much closer in performance to direct SQL. Currently, the Cypher query is over 15x slower, which is a major issue.
I believe my sample Cypher query is similar to, and aligns with, the sorting on aggregate functions sample in the official AGE documentation. Is this a bug that will be addressed in an upcoming release?
I'm looking for guidance on how to optimize this Cypher query to achieve performance more comparable to the direct PostgreSQL query. Any recommendations on AGE-specific best practices, indexing for aggregations, or relevant configuration tuning would be greatly appreciated.
For reference, when the ORDER BY clause is removed from the AGE/Cypher query, its execution time significantly improves.