Partial Project UDF optimization

### Description

Now in the partial project, we convert the columns to Arrow ColumnarBatch as ArrowColumnarRow(InternalRow), it will take good effect when the number of columns fed into Spark is small.
```
    val targetRow = new ArrowColumnarRow(vectors)
    for (i <- 0 until numRows) {
      targetRow.rowId = i
      proj.target(targetRow).apply(arrowBatch.getRow(i))
    }
```

The possible optimization:
1. If the number of columns fed to Spark exceeds a threshold, convert to UnsafeRow will make memory more friendly, this case `my_udf(a, b, c, plus(e))`, we will convert all the columns inner of my_udf, so the number of columns might be big.
2. Convert to Arrow ColumnarBatch loading the data to onheap, which may trigger GC, we try to allocate all the memory to offheap to make the memory more fluent, so we can convert to Columnar format VeloxColumnarBatch(java) and access the offheap memory directly like UnsafeRow, also add the class VeloxColumnarRow as Arrow.
3. The Spark project expression ProjectExec does not support ColumnarBatch, so we convert it to a compatible Row format, but this will not benefit from ColumnarExecution, ProjectExec should support ColumnarBatch, then the udf execution will process one column and then another column, rather than current row to row.
4. The cost model, the conversion cost now only consider the number of columns conversion cost, but the conversion consists of two part: computation cost and memory copy cost, e.g., the string copy cost is more than int, for example, udf(a) returns string, for hash(udf(a)), we should not compute udf(a) in Spark, it will introduce string fake ArrowRow to columnar conversion, if we can return hash(udf(a)) from Spark, the performance might be better. This conversion cost will amplify in complex data type. And also measure whether the complexity of the `hash` function that benefits from native execution can exceed the overhead of format conversion. Since Spark project has whole stage codegen for the functions, this may need further investigation.

### Gluten version

None

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Partial Project UDF optimization #11783

Description

Gluten version

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Partial Project UDF optimization #11783

Description

Description

Gluten version

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions