Skip to content

Partial Project UDF optimization #11783

@jinchengchenghh

Description

@jinchengchenghh

Description

Now in the partial project, we convert the columns to Arrow ColumnarBatch as ArrowColumnarRow(InternalRow), it will take good effect when the number of columns fed into Spark is small.

    val targetRow = new ArrowColumnarRow(vectors)
    for (i <- 0 until numRows) {
      targetRow.rowId = i
      proj.target(targetRow).apply(arrowBatch.getRow(i))
    }

The possible optimization:

  1. If the number of columns fed to Spark exceeds a threshold, convert to UnsafeRow will make memory more friendly, this case my_udf(a, b, c, plus(e)), we will convert all the columns inner of my_udf, so the number of columns might be big.
  2. Convert to Arrow ColumnarBatch loading the data to onheap, which may trigger GC, we try to allocate all the memory to offheap to make the memory more fluent, so we can convert to Columnar format VeloxColumnarBatch(java) and access the offheap memory directly like UnsafeRow, also add the class VeloxColumnarRow as Arrow.
  3. The Spark project expression ProjectExec does not support ColumnarBatch, so we convert it to a compatible Row format, but this will not benefit from ColumnarExecution, ProjectExec should support ColumnarBatch, then the udf execution will process one column and then another column, rather than current row to row.
  4. The cost model, the conversion cost now only consider the number of columns conversion cost, but the conversion consists of two part: computation cost and memory copy cost, e.g., the string copy cost is more than int, for example, udf(a) returns string, for hash(udf(a)), we should not compute udf(a) in Spark, it will introduce string fake ArrowRow to columnar conversion, if we can return hash(udf(a)) from Spark, the performance might be better. This conversion cost will amplify in complex data type. And also measure whether the complexity of the hash function that benefits from native execution can exceed the overhead of format conversion. Since Spark project has whole stage codegen for the functions, this may need further investigation.

Gluten version

None

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions