Each vector-valued column in a Spark DataFrame is replaced by ordinary columns to facilitate use in PySpark.
Columns of type VectorUDTare identified, and each vector element is placed in its own column.
For example, LogisticRegressionModel creates a column probability, which will be replaced
by columns probaility[0], probability[1], ... Note that these are not indexed, but instead
[0] is part of the column's name.
VectorAssembler, OneHotEncoder, and Model objects create User-Defined Types
which are vector-valued: either DenseVector or SparseVector, the elements of which may be
floats or integers.
The elements of these vectors are not directly indexable in PySpark. Moreover, when creating a
pandas_udf using pyarrow, if a VectorUDT is encountered the conversion will not be optimized.
Instead, a pandas Series of Object will result. These values must be converted row by row
using .tolist() or .values. Working with these in pandas is a tedious and error-prone process.
The vectorFlattenerUDT function provides the core functionality. Given a dataframe, it detects
columns with dataType VectorUDT, inspects one row to determine the length and value type,
generates a udf which places each vector element in its own column, and uses SparkSQL select
to return a dataframe without VectorUDTs.
If meta-data for the length and element type of VectorUDT was consistently and reliably set, there would be no need to inspect a row in order to infer them.
A Transformer class minimally wraps the above, and an Estimator/Model pair. The fit method
finds the names, lengths, and data types of vectors, and makes them available as parameters of the
Model obect which it emits.
A VectorReAssembler detects columns whose names end in [###] (where ### is a sequence of digits)
and uses VectorAssembler to re-assemble them into a vector. A dataframe which is flattened and
re-assembled will usually be identical to the initial dataframe. However, note that any
SparseVector-valued columns will be re-assembled as DenseVector.
Ideally, call flattenVectorUDT before pandas_udf, apply some pandas transformations,
(drop unneeded columns), and re-assemble. pandas_udf could be decorated to do this.
In the long term, Scala code along the lines of projection of each vector element into its own column followed by flatmap, together with a PySpark wrapper, might look a little cleaner.