UDF E2E tests failed

After merging #139, UDF E2E tests failed as the PR changed to use `pyspark.cloudpickle` instead of `inspect.getsource()`.

The root cause is that the `cloudpickle` by default will **not** pickle modules by value, and the PyTest doesn't run test code in `__main__` as normal python execution flow, that means all UDFs defined in the test code belongs to a different module than `__main__`, this leads to a mysterious `ModuleNotFound` exception when pickled UDFs were used in the remote PySpark job, as it cannot find the module where the pickled function belongs to, because the test module doesn't exist on the remote side.

Since version 2.0, `cloudpickle` resolves this issue by involving a new `register_pickle_by_value(module)` function, but unfortunately this function doesn't exist in the `cloudpickle` bundled with PySpark, which stays at 1.6 for years.

This issue actually will **not** happen in typical usage, e.g. in a Python REPL env or Jupyter notebook, as they all run current code in `__main__` module, it only affects PyTest.

One possible solution is to use the new `cloudpickle` but it introduces new dependencies which must be installed on the remote Spark cluster manually.

The other solution is to let user package their UDF into a separated Python file and submit this file along with the Spark job, but this is not compatible with current test code thus needs a major restructuring.

The workaround is pretty straightforward, changing the module name to `__main__` at the very beginning of the E2E test script, it looks like a hack but actually does the job pretty well.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

UDF E2E tests failed #167

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

UDF E2E tests failed #167

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions