[SPARK-54182][SQL][PYTHON] Optimize non-arrow conversion of `df.toPandas` #52897

Yicong-Huang · 2025-11-05T17:45:18Z

What changes were proposed in this pull request?

Following up with #52680, this PR optimizes the non-Arrow path of toPandas() to eliminate intermediate DataFrame creation.

Key optimizations:

Avoid intermediate DataFrame copy
- pd.DataFrame.from_records(rows) → Direct column extraction via zip(*rows)
- 2 DataFrame creations → 1 DataFrame creation
Optimize column-by-column conversion (especially for wide tables)
- Tuples → Lists for faster Series construction
- Implicit dtype inference → Explicit dtype=object
- pd.concat(axis="columns") + column rename → pd.concat(axis=1, keys=columns)
- Result: 43-67% speedup for 50-100 columns

Why are the changes needed?

Problem: Current flow creates DataFrame twice:

rows → pd.DataFrame.from_records() → temporary DataFrame → pd.concat() → final DataFrame

The intermediate DataFrame is immediately discarded, wasting memory. This is especially inefficient for wide tables where column-by-column overhead is significant.

Does this PR introduce any user-facing change?

No. This is a pure performance optimization with no API or behavior changes.

How was this patch tested?

Existing unit tests.
Benchmark

Benchmark setup:

Hardware: Driver memory 4GB, Executor memory 4GB
Configuration: spark.sql.execution.arrow.pyspark.enabled=false (testing non-Arrow path)
Iterations: 10 iterations per test case for statistical reliability
Test cases:
- Simple (numeric columns)
- Mixed (int, string, double, boolean)
- Timestamp (date and timestamp types)
- Nested (struct and array types)
- Wide (5, 10, 50, 100 column counts)

Performance Results

General Benchmark (10 iterations):

Test Case	Rows	OLD → NEW	Speedup
simple	1M	1.376s → 1.383s	≈ Tied
mixed	1M	2.396s → 2.553s	6% slower
timestamp	500K	4.323s → 4.392s	≈ Tied
nested	100K	0.558s → 0.580s	4% slower
wide (50)	100K	1.458s → 1.141s	28% faster 🚀

Column Width Benchmark (100K rows, 10 iterations):

Columns	OLD → NEW	Speedup
5	0.188s → 0.179s	5% faster
10	0.262s → 0.270s	≈ Tied
50	1.430s → 0.998s	43% faster 🚀
100	3.320s → 1.988s	67% faster 🚀

Was this patch authored or co-authored using generative AI tooling?

Yes. Co-Generated-by Cursor

zhengruifeng · 2025-11-11T07:45:37Z

python/pyspark/sql/pandas/conversion.py

                        timestamp_utc_localized=False,
-                    )(pser)
-                    for (_, pser), field in zip(pdf.items(), self.schema.fields)
+                    )(pd.Series(col_data, dtype=object))


why is dtype=object necessary?

zhengruifeng · 2025-11-11T07:46:17Z

python/pyspark/sql/pandas/conversion.py

-                    )(pser)
-                    for (_, pser), field in zip(pdf.items(), self.schema.fields)
+                    )(pd.Series(col_data, dtype=object))
+                    for col_data, field in zip(columns_data, self.schema.fields)


can we avoid creating columns_data: list[list] ?

Yicong-Huang added 2 commits November 5, 2025 09:30

refactor: remove intermedia data frame

cadc77d

fix: format

e8a2276

github-actions bot added SQL PYTHON labels Nov 5, 2025

fix: empty df code path

7358e95

HyukjinKwon changed the title ~~[WIP][Spark-54182] avoid intermedia dataframe creation in non-arrow codepath of df.toPandas~~ [WIP][SPARK-54182] Avoid intermedia dataframe creation in non-arrow codepath of df.toPandas Nov 10, 2025

refactor: simplify arrow to pandas conversion

01d7cfd

Yicong-Huang changed the title ~~[WIP][SPARK-54182] Avoid intermedia dataframe creation in non-arrow codepath of df.toPandas~~ [SPARK-54182] Avoid intermedia dataframe creation in non-arrow codepath of df.toPandas Nov 10, 2025

Yicong-Huang changed the title ~~[SPARK-54182] Avoid intermedia dataframe creation in non-arrow codepath of df.toPandas~~ [SPARK-54182][SQL][PYTHON] Avoid intermedia dataframe creation in non-arrow codepath of df.toPandas Nov 10, 2025

chore: clean up

3d5ac03

Yicong-Huang changed the title ~~[SPARK-54182][SQL][PYTHON] Avoid intermedia dataframe creation in non-arrow codepath of df.toPandas~~ [SPARK-54182][SQL][PYTHON] Optimize non-arrow conversion of df.toPandas Nov 10, 2025

zhengruifeng approved these changes Nov 11, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-54182][SQL][PYTHON] Optimize non-arrow conversion of `df.toPandas` #52897

[SPARK-54182][SQL][PYTHON] Optimize non-arrow conversion of `df.toPandas` #52897

Yicong-Huang commented Nov 5, 2025 •

edited

Loading

Uh oh!

zhengruifeng Nov 11, 2025

Uh oh!

zhengruifeng Nov 11, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

[SPARK-54182][SQL][PYTHON] Optimize non-arrow conversion of df.toPandas #52897

Are you sure you want to change the base?

[SPARK-54182][SQL][PYTHON] Optimize non-arrow conversion of df.toPandas #52897

Conversation

Yicong-Huang commented Nov 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Performance Results

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

zhengruifeng Nov 11, 2025

Choose a reason for hiding this comment

Uh oh!

zhengruifeng Nov 11, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

[SPARK-54182][SQL][PYTHON] Optimize non-arrow conversion of `df.toPandas` #52897

[SPARK-54182][SQL][PYTHON] Optimize non-arrow conversion of `df.toPandas` #52897

Yicong-Huang commented Nov 5, 2025 •

edited

Loading