-
Notifications
You must be signed in to change notification settings - Fork 584
Open
Labels
enhancementNew feature or requestNew feature or request
Description
Description
Now in the partial project, we convert the columns to Arrow ColumnarBatch as ArrowColumnarRow(InternalRow), it will take good effect when the number of columns fed into Spark is small.
val targetRow = new ArrowColumnarRow(vectors)
for (i <- 0 until numRows) {
targetRow.rowId = i
proj.target(targetRow).apply(arrowBatch.getRow(i))
}
The possible optimization:
- If the number of columns fed to Spark exceeds a threshold, convert to UnsafeRow will make memory more friendly, this case
my_udf(a, b, c, plus(e)), we will convert all the columns inner of my_udf, so the number of columns might be big. - Convert to Arrow ColumnarBatch loading the data to onheap, which may trigger GC, we try to allocate all the memory to offheap to make the memory more fluent, so we can convert to Columnar format VeloxColumnarBatch(java) and access the offheap memory directly like UnsafeRow, also add the class VeloxColumnarRow as Arrow.
- The Spark project expression ProjectExec does not support ColumnarBatch, so we convert it to a compatible Row format, but this will not benefit from ColumnarExecution, ProjectExec should support ColumnarBatch, then the udf execution will process one column and then another column, rather than current row to row.
- The cost model, the conversion cost now only consider the number of columns conversion cost, but the conversion consists of two part: computation cost and memory copy cost, e.g., the string copy cost is more than int, for example, udf(a) returns string, for hash(udf(a)), we should not compute udf(a) in Spark, it will introduce string fake ArrowRow to columnar conversion, if we can return hash(udf(a)) from Spark, the performance might be better. This conversion cost will amplify in complex data type. And also measure whether the complexity of the
hashfunction that benefits from native execution can exceed the overhead of format conversion. Since Spark project has whole stage codegen for the functions, this may need further investigation.
Gluten version
None
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
enhancementNew feature or requestNew feature or request