Skip to content

Conversation

@Yicong-Huang
Copy link
Contributor

@Yicong-Huang Yicong-Huang commented Nov 5, 2025

What changes were proposed in this pull request?

Following up with #52680, this PR optimizes the non-Arrow path of toPandas() to eliminate intermediate DataFrame creation.

Key optimizations:

  1. Avoid intermediate DataFrame copy

    • pd.DataFrame.from_records(rows) → Direct column extraction via zip(*rows)
    • 2 DataFrame creations → 1 DataFrame creation
  2. Optimize column-by-column conversion (especially for wide tables)

    • Tuples → Lists for faster Series construction
    • Implicit dtype inference → Explicit dtype=object
    • pd.concat(axis="columns") + column rename → pd.concat(axis=1, keys=columns)
    • Result: 43-67% speedup for 50-100 columns

Why are the changes needed?

Problem: Current flow creates DataFrame twice:

  • rowspd.DataFrame.from_records() → temporary DataFrame → pd.concat() → final DataFrame

The intermediate DataFrame is immediately discarded, wasting memory. This is especially inefficient for wide tables where column-by-column overhead is significant.

Does this PR introduce any user-facing change?

No. This is a pure performance optimization with no API or behavior changes.

How was this patch tested?

  • Existing unit tests.
  • Benchmark

Benchmark setup:

  • Hardware: Driver memory 4GB, Executor memory 4GB
  • Configuration: spark.sql.execution.arrow.pyspark.enabled=false (testing non-Arrow path)
  • Iterations: 10 iterations per test case for statistical reliability
  • Test cases:
    • Simple (numeric columns)
    • Mixed (int, string, double, boolean)
    • Timestamp (date and timestamp types)
    • Nested (struct and array types)
    • Wide (5, 10, 50, 100 column counts)

Performance Results

General Benchmark (10 iterations):

Test Case Rows OLD → NEW Speedup
simple 1M 1.376s → 1.383s ≈ Tied
mixed 1M 2.396s → 2.553s 6% slower
timestamp 500K 4.323s → 4.392s ≈ Tied
nested 100K 0.558s → 0.580s 4% slower
wide (50) 100K 1.458s → 1.141s 28% faster 🚀

Column Width Benchmark (100K rows, 10 iterations):

Columns OLD → NEW Speedup
5 0.188s → 0.179s 5% faster
10 0.262s → 0.270s ≈ Tied
50 1.430s → 0.998s 43% faster 🚀
100 3.320s → 1.988s 67% faster 🚀

Was this patch authored or co-authored using generative AI tooling?

Yes. Co-Generated-by Cursor

@HyukjinKwon HyukjinKwon changed the title [WIP][Spark-54182] avoid intermedia dataframe creation in non-arrow codepath of df.toPandas [WIP][SPARK-54182] Avoid intermedia dataframe creation in non-arrow codepath of df.toPandas Nov 10, 2025
@Yicong-Huang Yicong-Huang changed the title [WIP][SPARK-54182] Avoid intermedia dataframe creation in non-arrow codepath of df.toPandas [SPARK-54182] Avoid intermedia dataframe creation in non-arrow codepath of df.toPandas Nov 10, 2025
@Yicong-Huang Yicong-Huang changed the title [SPARK-54182] Avoid intermedia dataframe creation in non-arrow codepath of df.toPandas [SPARK-54182][SQL][PYTHON] Avoid intermedia dataframe creation in non-arrow codepath of df.toPandas Nov 10, 2025
@Yicong-Huang Yicong-Huang changed the title [SPARK-54182][SQL][PYTHON] Avoid intermedia dataframe creation in non-arrow codepath of df.toPandas [SPARK-54182][SQL][PYTHON] Optimize non-arrow conversion of df.toPandas Nov 10, 2025
timestamp_utc_localized=False,
)(pser)
for (_, pser), field in zip(pdf.items(), self.schema.fields)
)(pd.Series(col_data, dtype=object))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is dtype=object necessary?

)(pser)
for (_, pser), field in zip(pdf.items(), self.schema.fields)
)(pd.Series(col_data, dtype=object))
for col_data, field in zip(columns_data, self.schema.fields)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we avoid creating columns_data: list[list] ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants