-
Notifications
You must be signed in to change notification settings - Fork 7
Description
Description
coveo org:search:dump consistently crashes when attempting to export a very large source (~4M items). The failure always occurs after ~600k–700k results and produces:
RangeError: Invalid array length
This is caused by the CLI aggregating all field names from every result into a single array (aggregatedFieldsWithDupes), which eventually exceeds JavaScript’s maximum array length. The issue is structural and unrelated to memory exhaustion or Node heap size.
Steps To Reproduce
Steps to reproduce the behavior:
- Run a source dump on a large source, for example:
coveo org:search:dump --source "YourSourceName" --destination ./dump - Allow the dump to progress past ~600k results.
- Observe the CLI terminating with:
RangeError: Invalid array length - Check stack trace showing failure in:
extractFieldsFromAggregatedResultsdumpAggregatedResultsaggregateResultsfetchResults
Expected behavior
org:search:dump should:
- Successfully export large sources (millions of items).
- Stream results directly to disk without accumulating unbounded arrays.
- Track unique field names incrementally using a
Setor similar structure. - Avoid exceeding JavaScript’s array-length limits.
The dump should complete regardless of source size or number of fields.
Screenshots
Stack trace excerpt illustrating the error:
RangeError: Invalid array length
at Array.push (<anonymous>)
at Dump.extractFieldsFromAggregatedResults (.../dump.js:162:40)
at Dump.dumpAggregatedResults (.../dump.js:157:14)
at Dump.aggregateResults (.../dump.js:149:18)
Desktop:
- OS: Windows 11
- OS version: 23H2
- Browser: N/A (CLI operation)
- CLI Version: Latest version as of 2025-12-09
- Local Node version: e.g., 18.x
- Local NPM version: e.g., 9.x
Where the problem occurs
The issue originates in dump.ts:
private extractFieldsFromAggregatedResults() {
this.aggregatedFieldsWithDupes.push(
...this.aggregatedResults.flatMap(Object.keys)
);
}Call chain:
extractFieldsFromAggregatedResults
→ dumpAggregatedResults
→ aggregateResults
→ fetchResults
Because aggregatedFieldsWithDupes grows:
- for every result,
- across the entire dump,
- containing duplicates,
- and includes potentially thousands of fields per item (dynamic fields, dictionary fields, system fields),
…the array eventually crosses JavaScript’s array-length ceiling (~2³²−1). The spread operator push(...hugeArray) triggers:
RangeError: Invalid array length
Increasing Node’s heap size does not affect this outcome.
Why this design fails at scale
- JavaScript array lengths are capped to a 32-bit range.
aggregatedFieldsWithDupesgrows unbounded as the dump progresses.- Large sources with many fields multiply the size of this array quickly.
- The CLI attempts to aggregate all field names for all items before writing output, which is not feasible for multi-million-item dumps.
Thus, the failure is inherent to the current design rather than an environmental or memory constraint.
Impact
org:search:dumpcannot export large enterprise sources.- The crash occurs reliably around 600k–700k items processed.
- Prevents use of the CLI for:
- updating
permanentidmappings (ID_MAPPING) across associated machine learning models. - audit,
- analytics extraction.
- updating
--fieldsToExcludehelps only in limited cases; many sources contain high-cardinality dynamic fields where broad exclusion is not feasible.
Proposed fix
Switch from aggregate-then-write to a streaming write model
Rather than accumulating all field names and all results in memory, modify the algorithm to:
- Write each page of results directly to disk on retrieval.
- Track field names using a Set instead of a giant deduplicated-on-write array.
- Avoid ever using
push(...largeArray). - Keep memory usage constant regardless of source size.
Benefits
- Eliminates array-length overflow.
- Enables dumping extremely large sources.
- Reduces memory footprint dramatically.
- Matches proven durable patterns used in log processing, ETL tools, and database dump pipelines.