Skip to content

Bug Report: org:search:dump fails on large sources with RangeError: Invalid array length #1536

@an-d-uu

Description

@an-d-uu

Description

coveo org:search:dump consistently crashes when attempting to export a very large source (~4M items). The failure always occurs after ~600k–700k results and produces:

RangeError: Invalid array length

This is caused by the CLI aggregating all field names from every result into a single array (aggregatedFieldsWithDupes), which eventually exceeds JavaScript’s maximum array length. The issue is structural and unrelated to memory exhaustion or Node heap size.


Steps To Reproduce

Steps to reproduce the behavior:

  1. Run a source dump on a large source, for example:
    coveo org:search:dump --source "YourSourceName" --destination ./dump
  2. Allow the dump to progress past ~600k results.
  3. Observe the CLI terminating with:
    RangeError: Invalid array length
    
  4. Check stack trace showing failure in:
    • extractFieldsFromAggregatedResults
    • dumpAggregatedResults
    • aggregateResults
    • fetchResults

Expected behavior

org:search:dump should:

  • Successfully export large sources (millions of items).
  • Stream results directly to disk without accumulating unbounded arrays.
  • Track unique field names incrementally using a Set or similar structure.
  • Avoid exceeding JavaScript’s array-length limits.

The dump should complete regardless of source size or number of fields.


Screenshots

Stack trace excerpt illustrating the error:

RangeError: Invalid array length
    at Array.push (<anonymous>)
    at Dump.extractFieldsFromAggregatedResults (.../dump.js:162:40)
    at Dump.dumpAggregatedResults (.../dump.js:157:14)
    at Dump.aggregateResults (.../dump.js:149:18)

error.log


Desktop:

  • OS: Windows 11
    • OS version: 23H2
  • Browser: N/A (CLI operation)
  • CLI Version: Latest version as of 2025-12-09
  • Local Node version: e.g., 18.x
  • Local NPM version: e.g., 9.x

Where the problem occurs

The issue originates in dump.ts:

private extractFieldsFromAggregatedResults() {
  this.aggregatedFieldsWithDupes.push(
    ...this.aggregatedResults.flatMap(Object.keys)
  );
}

Call chain:

extractFieldsFromAggregatedResults
  → dumpAggregatedResults
    → aggregateResults
      → fetchResults

Because aggregatedFieldsWithDupes grows:

  • for every result,
  • across the entire dump,
  • containing duplicates,
  • and includes potentially thousands of fields per item (dynamic fields, dictionary fields, system fields),

…the array eventually crosses JavaScript’s array-length ceiling (~2³²−1). The spread operator push(...hugeArray) triggers:

RangeError: Invalid array length

Increasing Node’s heap size does not affect this outcome.


Why this design fails at scale

  • JavaScript array lengths are capped to a 32-bit range.
  • aggregatedFieldsWithDupes grows unbounded as the dump progresses.
  • Large sources with many fields multiply the size of this array quickly.
  • The CLI attempts to aggregate all field names for all items before writing output, which is not feasible for multi-million-item dumps.

Thus, the failure is inherent to the current design rather than an environmental or memory constraint.


Impact

  • org:search:dump cannot export large enterprise sources.
  • The crash occurs reliably around 600k–700k items processed.
  • Prevents use of the CLI for:
    • updating permanentid mappings (ID_MAPPING) across associated machine learning models.
    • audit,
    • analytics extraction.
  • --fieldsToExclude helps only in limited cases; many sources contain high-cardinality dynamic fields where broad exclusion is not feasible.

Proposed fix

Switch from aggregate-then-write to a streaming write model

Rather than accumulating all field names and all results in memory, modify the algorithm to:

  1. Write each page of results directly to disk on retrieval.
  2. Track field names using a Set instead of a giant deduplicated-on-write array.
  3. Avoid ever using push(...largeArray).
  4. Keep memory usage constant regardless of source size.

Benefits

  • Eliminates array-length overflow.
  • Enables dumping extremely large sources.
  • Reduces memory footprint dramatically.
  • Matches proven durable patterns used in log processing, ETL tools, and database dump pipelines.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions