Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
29 changes: 17 additions & 12 deletions python/pyspark/sql/pandas/conversion.py
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,7 @@
from typing import (
Any,
Callable,
Iterator,
List,
Optional,
Union,
Expand Down Expand Up @@ -208,18 +209,20 @@ def toPandas(self) -> "PandasDataFrameLike":

# Below is toPandas without Arrow optimization.
rows = self.collect()
if len(rows) > 0:
pdf = pd.DataFrame.from_records(
rows, index=range(len(rows)), columns=self.columns # type: ignore[arg-type]
)
else:
pdf = pd.DataFrame(columns=self.columns)

if len(pdf.columns) > 0:
if len(self.columns) > 0:
timezone = sessionLocalTimeZone
struct_in_pandas = pandasStructHandlingMode

return pd.concat(
# Extract columns from rows and apply converters
if len(rows) > 0:
# Use iterator to avoid materializing intermediate data structure
columns_data: Iterator[Any] = iter(zip(*rows))
else:
columns_data = iter([] for _ in self.schema.fields)

# Build DataFrame from columns
pdf = pd.concat(
[
_create_converter_to_pandas(
field.dataType,
Expand All @@ -230,13 +233,15 @@ def toPandas(self) -> "PandasDataFrameLike":
),
error_on_duplicated_field_names=False,
timestamp_utc_localized=False,
)(pser)
for (_, pser), field in zip(pdf.items(), self.schema.fields)
)(pd.Series(col_data, dtype=object))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is dtype=object necessary?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here is building a series to pass to _create_converter_to_pandas, which will convert the series to the declared type in field.dataType. So I think the dtype here is optional and unnecessary. If we do not supply object, then when creating series it will start to infer the type, which will be quite slow.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also tried to use field.dataType explicitly but it would need some conversion to pandas data type, which is the purpose of _create_converter_to_pandas. So I suggest we keep use object for disabling type inference.

for col_data, field in zip(columns_data, self.schema.fields)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we avoid creating columns_data: list[list] ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good point, although I think it is a list of list references so memory won't be too large of a difference. I changed it to iterator.

],
axis="columns",
axis=1,
keys=self.columns,
)
else:
return pdf
else:
return pd.DataFrame(columns=[], index=range(len(rows)))

def toArrow(self) -> "pa.Table":
from pyspark.sql.dataframe import DataFrame
Expand Down