Skip to content

UDF E2E tests failed #167

@windoze

Description

@windoze

After merging #139, UDF E2E tests failed as the PR changed to use pyspark.cloudpickle instead of inspect.getsource().

The root cause is that the cloudpickle by default will not pickle modules by value, and the PyTest doesn't run test code in __main__ as normal python execution flow, that means all UDFs defined in the test code belongs to a different module than __main__, this leads to a mysterious ModuleNotFound exception when pickled UDFs were used in the remote PySpark job, as it cannot find the module where the pickled function belongs to, because the test module doesn't exist on the remote side.

Since version 2.0, cloudpickle resolves this issue by involving a new register_pickle_by_value(module) function, but unfortunately this function doesn't exist in the cloudpickle bundled with PySpark, which stays at 1.6 for years.

This issue actually will not happen in typical usage, e.g. in a Python REPL env or Jupyter notebook, as they all run current code in __main__ module, it only affects PyTest.

One possible solution is to use the new cloudpickle but it introduces new dependencies which must be installed on the remote Spark cluster manually.

The other solution is to let user package their UDF into a separated Python file and submit this file along with the Spark job, but this is not compatible with current test code thus needs a major restructuring.

The workaround is pretty straightforward, changing the module name to __main__ at the very beginning of the E2E test script, it looks like a hack but actually does the job pretty well.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions